Skip to main content

The Chesapeake Digital Preservation Group: "Link Rot" and Legal Resources on the Web, 2012
 

"Link Rot" and Legal Resources on the Web: A 2014 Analysis by the Chesapeake Digital Preservation Group




Contents



Introduction

51 Percent of URLs from Original Data Set No Longer Work

Link Rot and Top-Level Domains, 2008-2014

Data Set 2: 2007 - 2014 Data

Link Rot in 2014

31.03 Percent of URLs No Longer Work

Link Rot and Top-Level Domains in 2014

Contact



Introduction


The Chesapeake Digital Preservation Group has completed its seventh annual investigation of link rot among the original URLs for online law- and policy-related materials archived though the group’s efforts.


The Chesapeake Digital Preservation Group is a collaborative digital preservation program for legal materials, reports, and documents posted to the web. The group is comprised of four member libraries—two academic law libraries, the Georgetown Law and Harvard Law School Libraries, and the State Law Libraries of Maryland and Virginia—and is part of the Legal Information Archive.


Access to web-published content can be lost as websites are routinely updated, reorganized, or deleted over time. In the eight years since the program began, the Chesapeake Group has built a digital archive collection comprising more than 9,600 digital items and over 4,200 titles, almost all originally posted to the web but captured and preserved within the group’s digital archive.


Every year, the Chesapeake Group investigates whether or not the documents in the archive can still be found at the original web addresses from which they were captured. The group analyzes two statistically significant samples of web addresses, or URLs, pulled from the archive’s records.


The first sample includes 579 original URLs for content captured from 2007-2008. This sample is revisited every year to document link rot and explore how it changes over time.


The second data set represents the full content of the archive at the time the study is conducted and provides an up-to-date snapshot of link rot among the original URLs for all the content currently in the archive. This data set includes all original URLs for materials captured from 2007-2014.





51 Percent of URLs from Original Data Set No Longer Work


In 2014, 292 out of 579 URLs in the sample no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, link rot has increased to 51.12 percent within seven years.


In 2008, the sample was analyzed for the first time as part of an evaluation of the archiving program, and link rot was found to be present in 48, or 8.3 percent, of the 579 URLs comprising the sample. At the time, a total of 1,266 web-based titles had been captured and archived. A random sample of 579 titles from the archive was generated for the analysis, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.


One year later, in 2009, the sample was analyzed a second time. Link rot was found to be present in 83 out of the original sample of 579 URLs. Within two years of capture, 14.3 percent of the archived titles had disappeared from their original URLs.


By the third year, in 2010, the prevalence of link rot had increased to 160 out of 579 URLs, to 27.9 percent. Link rot continued to increase in 2011, but by a slower margin, reaching 30.4 percent by the fourth year. The 2012 data showed an increase of 7.3 percent compared to 2011. In 2013 there was 6.5 percent increase from the previous year. This year the link rot percent increase held steady at 6.1 percent. Increases in link rot from 2008 through 2014 are illustrated in Figure 1 and Table 1.


Increases in link rot from 2008 through 2012 are illustrated in Figure 1 and Table 1, below.


Link Rot, March 2008




Year

Content Missing

Working URLs

% Link Rot

Change from previous yeart

2008

48

531

8.30%

0%

2009

83

496

14.30%

6.00%

2010

160

419

27.60%

13.30%

2011

176

403

30.40%

2.80%

2012

218

361

37.70%

7.30%

2013

256

323

44.20%

6.50%

2014

292

287

51.12%

6.92%








Link Rot and Top-Level Domains, 2008-2014


More than 90 percent of the top-level domains in the original sample are state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 41 percent, 32 percent, and 17 percent of the sample, respectively. Other top-level domains, comprising approximately 7 percent of the sample, combined, include .edu, .com, and .net, which respectively represent 2.9, 2.2, and 1.9 percent of the sample. Less than 3 percent of the sample consists of .mil, .us, .info, .uk, .au, .ca, and .int top-level domains. The sample also includes one IP address.


In 2014, the content at .org domains showed the highest increase in link rot. More than 56 percent of the materials posted to organization domains disappeared from the original documented web addresses. Link rot on government web pages also increased to 55 percent for .gov domains and to 44 percent for .state.[state code].us domains in 2014.

This is the second time that one of the top three domains had over 50 percent of its links no longer working. For the first year a second of the top three domains reached over 50 percent link rot, with the .org domains at 56 percent link rot. Education domains also showed an increase in link rot with almost 68 percent of the original domains disappearing. For the first year network domain link rot rose to 51.12 percent.

A list of all top-level domains found in the sample, along with link rot detected in 2008-2014, is available in Table 2.



Table 2: Link Rot Sample by Domain (Original Sample Data)

Top-Level Domain

Total in Sample

Link Rot Frequency 2008

Link Rot Frequency 2009

Link Rot Frequency 2010

Link Rot Frequency 2011

Link Rot Frequency 2012

Link Rot Frequency 2013

Link Rot Frequency 2014

.state.__.us

240

26 (10.8%)

38 (15.8%)

77 (32.1%)

73 (30.4%)

81 (33.8%)

98 (40.8%)

108 (44.4%)

.org

184

7 (8.3%)

21 (11.4%)

41 (22.3%)

57 (31%)

80 (43.5%)

83 (45.1%)

104 (56.5%)

.gov

100

10 (10%)

13 (13%)

25 (25%)

31 (31%)

36 (36%)

51 (51%)

55 (55%)

.edu

17

2 (11.8%)

6 (35.3%)

6 (35.3%)

3 (17.6%)

7 (41.2%)

9 (52.9%)

11 (64.7%)

.com

13

2 (15.4%)

2 (15.4%)

4 (30.8%)

4 (30.8%)

5 (38.5%)

6 (46.1%)

8 (61.5%)

.net

11

0

1 (9.1%)

3 (27.3%)

3 (27.3%)

4 (36.4%)

5 (45.5%)

5 (45.5%)

.mil

3

0

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

1 (33.3%)

.us

3

0

0

0

0

0

0

0

.info

2

1 (50%)

1 (50%)

1 (50%)

2 (100%)

2 (100%)

2 (100%)

2 (100%)

.uk

2

0

0

1 (50%)

1 (50%)

1 (50%)

1 (50%)

1 (50%)

.au

1

0

0

0

0

0

0

0

.ca

1

0

0

0

0

0

0

0

.int

1

0

0

0

0

0

0

0

[IP address]

1

0

0

1 (100%)

1 (100%)

1 (100%)

1 (100%)

1 (100%)

TOTAL

579

48 (8.3%)

83 (14.3%)

160 (27.6%)

176 (30.4%)

218 (37.7%)

256 (44.2%)

296 (51.12%)





Data Set 2: 2007 - 2014 Data

Link Rot in 2014

For the present analysis, a list of all URLs was generated. In 2014, the collection included more than 9,600 digital items and over 4,200 titles. Some of the titles were directly deposited by the content creators and therefore had no original web addresses; as the Chesapeake Group has increased contact with content producers over the years, a fraction of the content archived is now deposited by the creators for archiving, rather than posted to the web for capture. After removing duplicate titles and items without web addresses, 6,481 URLs were checked for link rot.

 


31.03 Percent of URLs No Longer Work


In 2014, 2011 out of 6481 URLs in the total collection no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, 31.03 percent of the links no longer worked.



Link Rot and Top-Level Domains in 2014


The total collection, slightly more than 88 percent of top-level domains are state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 25.1 percent, 36.7 percent, and 26.2 percent of the collection, respectively. The content at .gov domains showed the highest percentage of link rot. More than 41 percent of the materials posted to government domains disappeared from the original documented web addresses. Link rot on state-government and organization domains was 35.6 and 22.2 percent in 2014.

Other top-level domains, comprising slightly over 11 percent of the collection, include .edu, .com, and .net, which respectively represent 4, 4.2, and 1.4 percent of the collection. Barely over 2 percent of the remaining collection consists of .mil, .us, .info, .uk, .au, .ca, .cn, and .int top-level domains. The collection also includes 7 IP addresses.

Overall the results of the seven years of systemically checking links have demonstrated that documents posted on web sites will disappear at an increasing rate over time. See Table 3, from previous year’s analysis. Table 4 shows link rot by domain of the original sample, the sample from 2013, and 2014’s data set. The value of harvesting these materials before they are no longer available at their original URLs is demonstrated by the high use of these materials. In 2013-2014, over 500,000 items were viewed. It is likely that the value of this project and similar ones will become even more significant in future years.

 
Table 3: Link Rot by Year of Capture
(2013 Sample Data)

Year of Capture

Total

Link Rot

Working URLs

% Link Rot

2007

228

102

126

44.7%

2008

149

68

81

45.6%

2009

132

59

73

44.7%

2010

183

47

136

25.7%

2011

91

23

68

25.3%

2012

37

9

28

24.3%

2013

6

1

5

16.7%

Scanned

16

n/a

n/a

n/a

TOTAL

842

309

617

 


Table 4: Link Rot by Domain (2014)

 

2008 Sample

2009 Sample

2010 Sample

Top-Level Domain

Total in Sample

Link Rot

Total in Sample

Link Rot

Total in Sample

Link Rot

.state.__.us

240

26 (10.8%)

205

87 (42.4%)

1641

549 (33.5%)

.org

184

7 (8.3%)

314

103 (32.8%)

2370

527 (22.2%)

.gov

100

10 (10%)

196

86 (43.9%)

1689

696 (41.2%)

.edu

17

2 (11.8%)

36

15 (41.7%)

260

111 (42.7%)

.com

13

2 (15.4%)

38

14 (36.8%)

274

55 (20.1%)

.net

11

0

23

3 (13.0%)

93

15 (16.1%)

.mil

3

0

1

0

12

3 (25%)

.us

3

0

0

0

43

30 (69.8%)0

.info

2

1 (50%)

2

0

21

8 (38.1%)

.uk

2

0

1

0

24

5 (20.8%)

.au

1

0

0

0

3

0

.af

0

0

0

0

2

1 (50%)

.at

0

0

0

0

4

0

.be

0

0

0

0

2

0

.ca

1

0

2

1 (50.0%)

12

1 (0.8%)

.ch

0

0

1

0

2

0

.cn

0

0

0

0

1

0

.int

1

0

3

0

11

3 (27.3%)

.eu

0

0

4

1 (25.0%)

9

1 (11.1%)

[IP address]

1

0

0

0

8

6 (75%)

Scanned

0

0

16

0

0

0

TOTAL

579

48 (8.3%)

842

309 (36.7%)

6481

2011 (31.02%)




Contact


Carolyn Cox, Digital Collections Librarian
Georgetown University Law Library
111 G St., NW
Washington, DC 20001
Phone: 202-662-9167
E-mail




Select the collections to add or remove from your search
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
 
OK