Skip to main content

The Chesapeake Digital Preservation Group: "Link Rot" and Legal Resources on the Web, 2013
 

"Link Rot" and Legal Resources on the Web: A 2013 Analysis by the Chesapeake Digital Preservation Group




Contents



Introduction

Data Set 1: Original 2008 Sample Data

Link Rot and Top-Level Domains, 2008-2013

Data Set 2: 2013 Data Sample

Link Rot and Top-Level Domains in 2013

Contact





Introduction



The Chesapeake Digital Preservation Group has completed its sixth annual investigation of link rot among the original URLs for online law- and policy-related materials archived though the group's efforts.


The Chesapeake Digital Preservation Group is a collaborative digital preservation program for legal materials, reports, and documents posted to the web. The group is comprised of four member libraries - two academic law libraries, the Georgetown Law and Harvard Law School Libraries, and the State Law Libraries of Maryland and Virginia - and is part of the Legal Information Archive.


Access to web-published content can be lost as websites are routinely updated, reorganized, or deleted over time. In the seven years since the program began, the Chesapeake Group has built a digital archive collection comprising more than 8,954 digital items and over 4,000 titles, almost all originally posted to the web but captured and preserved within the group's digital archive.


Every year, the Chesapeake Group investigates whether or not the documents in the archive can still be found at the original web addresses from which they were captured. The group analyzes two samples of web addresses, or URLs, pulled from the archive's records. The first sample includes 579 original URLs for content captured from 2007-2008. This sample is revisited every year to document link rot and explore how it changes over time. The second sample represents the full content of the archive at the time the study is conducted and provides an up-to-date snapshot of link rot among the original URLs for all the content currently in the archive. In 2013, this sample included 842 original URLs for materials captured from 2007-2013.





Data Set 1: Original 2008 Sample Data

44 Percent of URLs from Original Data Set No Longer Work

In 2013, 256 out of 579 URLs in the sample no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, link rot has increased to 44.2 percent within six years.


In 2008, the sample was analyzed for the first time as part of an evaluation of the archiving program, and link rot was found to be present in 48, or 8.3 percent, of the 579 URLs comprising the sample. At the time, a total of 1,266 web-based titles had been captured and archived. A random sample of 579 titles from the archive was generated for the analysis, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.


One year later, in 2009, the sample was analyzed a second time. Link rot was found to be present in 83 out of the original sample of 579 URLs. Within two years of capture, 14.3 percent of the archived titles had disappeared from their original URLs. By the third year, in 2010, the prevalence of link rot had increased to 160 out of 579 URLs, to 27.9 percent. Link rot continued to increase in 2011, but by a slower margin, reaching 30.4 percent by the fourth year. The 2012 data showed an increase of 7.3 percent compared to 2011. In 2013 there was 6.5 percent increase from the previous year. Increases in link rot from 2008 through 2013 are illustrated in Figure 1 and Table 1.


Link Rot and Top-Level Domains, 2008-2013

More than 90 percent of the top-level domains in the original sample are state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 24 percent, 37 percent, and 23 percent of the sample, respectively. In 2013, the content at .gov domains showed the highest increase in link rot. More than 50 percent of the materials posted to government domains disappeared from the original documented web addresses. This is the first time that one of the top three domains had over 50 percent of its links no longer working.


A list of all top-level domains found in the sample, along with link rot detected in 2008-2013, are available in Table 2.





Figure 1






Table 1: Percentage Link Rot Increase
(Original Sample Data)



Year

Content Missing

Working URLs

% Link Rot

Change from previous year

2008

48

531

8.3%

 

2009

83

496

14.3%

6.0%

2010

160

419

27.6%

13.3%

2011

176

403

30.4%

2.8%

2012

218

361

37.7%

7.3%

2013

256

323

44.2%

6.5%





Table 2: Link Rot Sample by Domain
(Original Sample Data)




Top-Level Domain

Total in Sample

Link Rot Frequency 2008

Link Rot Frequency 2009

Link Rot Frequency 2010

Link Rot Frequency 2011

Link Rot Frequency 2012

Link Rot Frequency
2013

.state.__.us

240

26 (10.8%)

38 (15.8%)

77 (32.1%)

73 (30.4%)

81 (33.8%)

98 (40.8%)

.org

184

7 (8.3%)

21 (11.4%)

41 (22.3%)

57 (31%)

80 (43.5%)

83 (45.1%)

.gov

100

10 (10%)

13 (13%)

25 (25%)

31 (31%)

36 (36%)

51 (51.0%)

.edu

17

2 (11.8%)

6 (35.3%)

6 (35.3%)

3 (17.6%)

7 (41.2%)

9 (52.9%)

.com

13

2 (15.4%)

2 (15.4%)

4 (30.8%)

4 (30.8%)

5 (38.5%)

6 (46.1%)

.net

11

0

1 (9.1%)

3 (27.3%)

3 (27.3%)

4 (36.4%)

5 (45.5%)

.mil

3

0

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

.us

3

0

0

0

0

0

0

.info

2

1 (50%)

1 (50%)

1 (50%)

2 (100%)

2 (100%)

2 (100%)

.uk

2

0

0

1 (50%)

1 (50%)

1 (50%)

1 (50%)

.au

1

0

0

0

0

0

0

.ca

1

0

0

0

0

0

0

.int

1

0

0

0

0

0

0

[IP address]

1

0

0

1 (100%)

1 (100%)

1 (100%)

0

TOTAL

579

48 (8.3%)

83 (14.3%)

160 (27.6%)

176 (30.4%)

218 (37.7%)

256 (44.2)







Data Set 2: 2013 Data Sample

Link Rot in 2013: A Snapshot

For the present analysis, a new, separate sample of URLs was generated. In 2013, the collection included 8,627 digital items and 4,008 titles. To ensure statistically relevant results at a 95 percent confidence level and confidence interval of +/- 3, a random sample of 842 titles were selected for the 2013 study.*

In the 2013 data sample, 36.7 percent, or 309 links, or no longer worked. This represents a 10.8 percent increase in link rot in from the previous year. This is also the largest percentage of change since the first data were analyzed in 2008. Previously, the largest change occurred between 2008 and 2009 when there was there was an 8.7 percent increase in link rot. See Figure 2 and Table 3.

Almost 85 percent of the top-level domains in the 2013 sample were state-government (24.3%), organization (37.2%), and government URLs (23.2.). This year saw a substantial increase in the number of government URLs (.gov) that no longer worked. In 2012 data set, 23.9 percent of the .gov URLs no longer worked. This year, 43.9 percent no longer worked. This trend was also observed in the original data sample results where link rot from the original data set was 51.0 percent. State-government and organizational links saw a more modest increase of 9.8 percent and 7.1 percent respectively. See Table 5.

Overall, the results of the six years of systemically checking links have demonstrated that documents posted on web sites will disappear at an increasing rate over time. See Table 4. The value of harvesting these materials before they are no longer available at their original URLs is demonstrated by the high use of these materials. During March 2013, the time the 2013 sample set was taken, over 84,000 items were retrieved. In 2012, 1.5 million items viewed. It is likely that the value of this project and similar ones will become even more significant in future years.





Figure 2








Table 3: Percentage Link Rot Increase

(2013 Sample Data)

Year

Not working

Working

Total

% Link Rot

Change from previous year

2008

48

531

579

8.3%

 

2009

93

587

680

13.7%

5.4%

2010

165

571

736

22.4%

8.7%

2011

157

646

803

19.6%

-2.9%

2012

215

612

827

25.9%

6.4%

2013

309

533

842

36.7%

10.8%






Table 4: Link Rot by Year of Capture

(2013 Sample Data)

Year of Capture

Total

Link Rot

Working URLs

% Linkrot

2007

228

102

126

44.7%

2008

149

68

81

45.6%

2009

132

59

73

44.7%

2010

183

47

136

25.7%

2011

91

23

68

25.3%

2012

37

9

28

24.3%

2013

6

1

5

16.7%

Scanned

16

n/a

n/a

n/a

TOTAL

842

309

617

 





Table 5: Link Rot by Domain
(2013 Sample)



 

2008 Sample

2009 Sample

2010 Sample

2011 Sample

2012 Sample

2013 Sample

Top-Level Domain

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

.state.__.us

240

26 (10.8%)

235

37 (15.7%)

256

78 (30.5%)

224

57 (25.4%)

215

70 (32.6%)

205

87 (42.4%)

.org

184

7 (8.3%)

212

29 (13.7%)

224

45 (20.1%)

290

45 (15.5%)

315

81 (25.7%)

314

103 (32.8%)

.gov

100

10 (10%)

155

17 (11%)

159

25 (15.7%)

167

32 (19.2%)

188

45 (23.9%)

196

86 (43.9%)

.edu

17

2 (11.8%)

23

6 (26%)

28

2 (7.1%)

40

7 (17.5%)

46

6 (13%)

36

15 (41.7%)

.com

13

2 (15.4%)

22

1 (4.5%)

28

5 (17.9%)

36

6 (16.7%)

26

5 (19.2%)

38

14 (36.8%)

.net

11

0

12

0

22

3 (13.6%)

10

3 (30%)

11

1 (9.1%)

23

3 (13.0%)

.mil

3

0

4

0

5

2 (40%)

2

1 (50%)

2

1 (50%)

1

0

.us

3

0

5

0

2

0

15

2 (13.3%)

5

1 (20%)

--

--

.info

2

1 (50%)

3

2 (66.7%)

2

0

5

2 (40%)

3

2 (66.7%)

2

0

.uk

2

0

3

1 (33.3%)

3

2 (66.7%)

2

1 (50%)

5

1 (20%)

1

0

.au

1

0

1

0

1

0

1

0

--

--

--

--

.af

--

--

--

--

--

--

--

--

1

1 (100%)

--

--

.at

--

--

--

--

--

--

2

0

1

0

--

--

.be

--

--

--

--

--

--

--

--

1

0

--

--

.ca

1

0

1

0

--

--

2

0

1

0

2

1 (50.0%)

.ch

--

--

--

--

--

--

1

0

--

--

1

0

.int

1

0

2

0

2

0

3

0

4

0

3

0

.eu

--

--

1

0

2

1 (50%)

3

1 (33.3%)

3

0

4

1 (25.0%)

[IP address]

1

0

1

0

2

2 (100%)

--

--

--

--

--

--

Scanned

--

--

--

--

--

--

--

--

--

--

16

--

TOTAL

579

48
(8.3%)

680

93
(13.7%)

736

165
(22.4%)

803

157
(19.6%)

827

214
(25.9%)

842

309
(36.7%)






Contact

Mary Jo Lazun
Head of Collection Management
Maryland State Law Library
mjlazun@mdcourts.gov



*Sixteen of the titles selected for the sample were directly deposited by the content creators and therefore had no original web addresses; as the Chesapeake Group has increased contact with content producers over the years, a small fraction of the content archived is now deposited by the creators for archiving, rather than posted to the web for capture.
Select the collections to add or remove from your search
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
 
OK