Archiving websites

Test of archiving software / Internet and web-archiving resources

 
 
This web page presents material related to the publication Archiving websites. General considerations and strategies by Niels Brügger (The Centre for Internet Research, Aarhus 2005).
 

This book treats the micro archiving of websites, i.e. archiving by researchers, students or others without special technical knowledge who, using a standard computer, wish to save a website for further study. The phenomenon is discussed from the standpoint that Internet research must be able to stabilise and save the object of its analysis. However, the Internet is endowed with certain fundamental media-specific dynamics that make stabilisation difficult. Based on an account and discussion of these dynamics (linked as they are to sender, text and recipient) the following double conclusion is reached.
 
 
 
 
Firstly, unlike other well-known media, the Internet does not simply exist in a form suited to being archived, but rather is first formed as an object of study in the archiving, and it is formed differently depending on who does the archiving, when, and for what purpose. Secondly, this means that there is an element of subjective creation in the archived material, so that methodical deliberations are necessary — in other words, the answers to why and how the archived material has been created. These conclusions form the starting point for the last section of the book, which, based on comprehensive tests of archiving software, discusses in depth the elements that can be included in an archiving strategy.
 
 

The book is free of charge, and as long as in print, copies of the book may be obtained by contacting cfi@imv.au.dk. Please specify complete address. An electronic version can be downloaded from this site (for the purpose of citation please note that the printed and electronic versions are identical).
 
  
Please address any comments on the book or the website to nb@imv.au.dk.
 
A test of archiving software has been carried out as a supplement to the book. The test was done by graduate student Bo Hovgaard Thomasen, and its premises and main results are explicated in the text Test of software and strategies for micro-archiving websites.

Note: We do not have the resources to offer technical support or other advice on the use of the tested archiving programmes beyond what can be found in the individual tests.
 

Top

Test of archiving software for micro-archiving websites (2004)

General conclusions

The test of 18 different software programmes for micro-archiving websites, with 4 archiving methods, was carried out from July-December 2004 by graduate student Bo Hovgaard Thomasen, on the background of Niels Brügger's book (2005) Archiving websites. General considerations and strategies.

As regards software that archives a 'complete website', we can conclude that the programmes in the test that archive most completely are WebHTTrack 3.33-beta-3 and WinHTTrack 3.33-beta-3. DeepVacuum 1.24, wget 1.9, and WebReaper 9.8 can also be used, but their archiving results have more deficiencies than the two first-mentioned programmes. There are considerable differences between the programmes - among other things, archiving speed – but they can all archive websites so that they appear more or less as they do online. The exception to this rule is that content elements requiring an online presence for viewing cannot be archived using this method. It was found to be advantageous to limit link-levels, domains (internal and external) and filtering, in order to ensure that archiving was limited to the desired web pages.

A further strong point of the five above-mentioned programmes was that they can be used free of charge, and that they are continually being further developed and updated. The remaining programmes for archiving a complete website that were tested cannot be recommended, either because they cannot archive a sufficient number of content elements, so that the archived website does not appear acceptably correct, or because they archive in an operative system- and programme-specific format, or because their cost is prohibitive compared to archiving capability, especially considering that the first five programmes are free.

All the programmes for archiving individual web pages in a static state, screenshots, and screen recordings can be used to archive. Some of the programmes are more flexible than others, for instance as regards editing options, but this is then usually reflected in their purchase cost. SnagIt 7.1.2 deserves special mention, as the programme can archive all of the above while being fast and easy to use. However, it is one of the most expensive programmes in the test. One programme, Web2Pic 1.1 ('individual web page in a static state'), is unable to archive all types of web pages.

Top

Test results

Software for archiving complete websites

Adobe Acrobat Professional 6.01

Windows/
Mac OS X

5187.50 DKK; 30- day test version can be downloaded for Windows

Adobe Acrobat Professional archives complete web pages based on keying in a URL, after which the website is downloaded and saved. The strength of Adobe Acrobat Professional lies in its archiving format: 'PDF' (Portable Document Format), which can be viewed on all platforms and is probably a relatively future-oriented format. The archiving attained with this programme is not perfect – often relatively many elements of the archived web pages are lacking, and at the same time it often happens that the content elements are not correctly positioned in the archived version, so that the experience of the web page is far from the same as the original experience on the browser. Although some pages are remarkably correctly archived, the archived material appears relatively chaotic and incomplete. In addition, setting options are extremely limited, resulting in archiving where the programme either easily includes irrelevant web pages or fails to include a sufficient number of pages. A further disadvantage is the fact that the programme first downloads all the archived material to the computer memory, after which it is stored on the hard disk. Finally, it should be noted that Adobe Acrobat makes up the archived web pages as paper pages, in formats such as A4, A3, etc., so that a web page does not appear as it does on the browser. Finally, we must conclude that the cost of procuring Adobe Acrobat Professional is very high, compared to its ability to archive web pages.

see test details

DeepVacuum 1.24

Mac OS X

7,00 USD

DeepVacuum (vers. 1.24) uses the wget 1.9 programme to archive the source codes and other content of web pages, and to convert these elements to a navigable offline version. The programme cannot archive content elements requiring an online connection for viewing. The graphic interface makes using command-line based wget less complicated, but at the same time the interface means reduced flexibility and configuration speed. Furthermore, archiving is relatively slow. The archived websites often appear correct, but the programme does not archive as correctly as, for instance, WebHTTrack and WinHTTrack. DeepVacuum allows several websites to be archived simultaneously.

see test details

Microsoft Internet Explorer 5.2.3

Mac OS X

Free

The Microsoft Internet Explorer 5.2.3 browser can archive up to five levels of a website's hyperstructure. The web page is archived in the 'Web Archive' format (optional file name extension .waff), which is both a platform-specific (Mac OS X) and programme-specific format. Web pages are usually archived correctly. It must be seen as a disadvantage that web pages archived with this program can only be inputted into Internet Explorer 5.2.3, and can only be used with Mac OS X. A further complaint is that archiving takes place at a very low speed and occasionally freezes, after which the process must be begun again

see test details

Microsoft Internet Explorer 6.0

Windows

Free

The Microsoft Internet Explorer 6 browser can archive one level of a website: i.e. archive one web page. The web page is archived in the 'Web Archive, single file' format, (filename extension .mht), which is both a platform-specific (Windows) and programme-specific format. Web pages are usually archived correctly. It must be seen as a disadvantage that net pages archived with this program can only be inputted into Internet Explorer 6, and can only be used with Windows.

see test details

MM3-WebAssistant Private 2005

Java

Free; 'Professional' version with additional options costs 29.95 EUR

MM3-Webassistant saves web pages visited by the browser in a cache for offline use. The archived material often appears extremely correct. However, a disadvantage of this archiving program is that the archived web pages can only be viewed on a computer with MM3-Webassistant installed. This archiving programme is most applicable in cases where the archived material is personal (working) copies, rather than serving as documentation or appendices. The programme can be used on all platforms, if Java Virtual Machine is installed (which is the case for most newer operative systems).

see test details

WebHTTrack Website Copier 3.33-beta-3

UNIX/
Mac OS X

Free

WebHTTrack is a command-line operated offline browser that can archive websites' source codes and remaining content, as well as converting these elements so that the archived versions are navigable. An advantage of archiving with HTTrack is that the archived material is archived in the format in which it is written, so that the archived pages appear in the browser, and one can work with the archived material just as with the online version. HTTrack cannot archive material requiring an online connection for viewing (typically chat, polls, test-yourself, streamed elements, most games). On the other hand, the programme is capable of converting web pages remarkably well, so that links usually work internally in the archived version, and web pages usually appear as they did online, with the exception of the online elements mentioned. Several archiving processes can be carried out at the same time with this programme, and archiving can be automated via scripts ( see an example of an AppleScript that starts 25 simultaneous archiving processes) .

see test details

WebReaper 9.8

Windows

Free

WebReaper archives websites' source codes and other elements, as well as converting these files so that the archived elements can be used offline. Elements requiring an online connection for viewing cannot be archived with this programme. WebReaper archives rapidly but individual pages are often missing or appear defective. A further complaint against the programme is that certain limitations of the material to be archived are not possible; among other things, external web pages (domain boundaries) can only be exempted from archiving with difficulty. In spite of this, it should be noted that the programme is capable of archiving many web pages correctly.

see test details

wget 1.9/
(+ wGetGUI 1.05)

UNIX/ Windows/
Mac OS X

Free

wget archives a copy of web pages' source code and other elements, and converts web pages' links so that it can be used in an offline version. Elements requiring an online connection for viewing cannot be archived with wget. This test covers wget for MS-DOS and the graphic interface wGetGUI (only for Windows). The programme archives relatively correctly – many pages are correctly archived. On the other hand, archiving speed is extremely low, which is a serious deficiency in the programme. Several archiving processes can be carried out at the same time, and archiving can be automated with wget (using scripts or batch files).

see test details

WinHTTrack Website Copier 3.33-beta-3

Windows

Free

WinHTTrack is an offline browser that can archive websites' source codes and remaining content, as well as converting these elements so that the archived versions are navigable. An advantage of archiving with WinHTTrack is that the archived material is archived in the format in which it is written, so that the archived pages appear in the browser. WinHTTrack cannot archive material requiring an online connection for viewing (typically chat, polls, test-yourself, streamed elements, most games). On the other hand, the programme is capable of converting web pages remarkably well, so that links usually work internally in the archived version, and web pages usually appear as they did online, with the exception of the online elements mentioned. It is possible to carry out several archiving processes at the same time with this programme. A command line version for MS-DOS is included in the WinHTTrack software package, which makes automation of the archiving process possible by the use of batch scripts.

see test details

Software for archiving individual web pages in a static state

The function Save As PDF...

Mac OS X

Integral in Mac OS X

With the function 'Save As PDF', Mac OS X makes it possible to print a web page as a PDF file, instead of for the printer. It is advantageous to use the function to archive the visual parts of a web page, which then appear as a static snapshot of the web page. The method cannot be used to archive dynamic elements. The (static) visual parts of the archive web page appear remarkably correct with this archiving function, and the only disadvantage is that the web page is made up in a printable format (A4, A3, etc.).

see test details

Paparazzi! 0.1.8

Mac OS X

Free

Paparazzi! Is a programme that can make a screenshot of a single web page. One advantage of using this programme is that the page visually appears exactly as experienced in the browser – with the exception of dynamic elements, which are not archived by the programme. Another strong point with Paparazzi! Is that it is designed for the sole purpose of archiving individual web pages. This solves one of the problems with screenshot programmes, which is that they typically only allow for shots of windows or areas of the screen – not the whole web page. A disadvantage of the programme and screen shots/screen recording in general, is that a person must be present during the entire archiving process, as this is done manually, one web page at a time.

see test details

PrimoPDF 1.0

Windows

Free

PrimoPDF installs a printer that permits printing a web page as a PDF file. The programme is suitable for use in archiving the visual parts of a web page as static snapshots. Dynamic elements are not archived in this method. The (static) visual parts of the archived web page appear remarkably correct when archived using PrimoPDF, although a functional disadvantage of the programme is that it is often necessary to manually set the scale to be used in archiving; otherwise the full width of the web page is not included in the archiving. It is a disadvantage that the web page is made up in printable format (A4, A3, etc.).

see test details

SnagIt 7.1.2

Windows

39,95 EUR

SnagIt can make a screenshot of an individual web page so that the archived version visually appears the same as when seen on the browser – except that the archived web page is static. A disadvantage of the programme, and screenshots/screen recording in general, is that a person must be present during the entire archiving process, as this takes place manually, one web page at a time. A further disadvantage is the absence of sound, video and other dynamics in the archived material. One great advantage to the programme is that it can be integrated with the Internet Explorer 6.0 browser, making archiving of a web page very simple.

see test details

Web2Pic 1.1

Windows

Free

Web2Pic can make a screenshot of an individual web page. The archived material appears as a static snapshot of the web page and dynamics are thus not archived. The programme is not always able to archive the web page (among others, this was the case when archiving www.dr.dk/skum and tv2.dk in this test), which instead appears severely lacking. It should be noted that the programme archives a great number of web pages correctly. It is also an advantage that Web2Pic is free.

see test details

Webkit2png 0.4

Mac OS X

Free

Webkit2png can make a screenshot of an individual web page. An advantage in using this programme to archive web pages is that the page visually appears 100% as experienced in the browser. However, the programme does not archive dynamic elements. A strong point in Webkit2png is that the programme is operated from the command line, which enables automatic archiving with the aid of scripts. See an example of a simple UNIX-script for archiving 17 websites or the script photourl.sh that archives the URLs specified in a text file using the webkit2png programme.

see test details

Software for archiving screenshots

The Print Screen utility

Windows

Integral in Windows

Windows has a built-in utility that can make a screenshot of the screen image when the Print Screen hotkey is activated. The utility archives satisfactorily, but is inflexible. Thus it is only possible to capture the entire screen image, so that the screenshot cannot be limited to regions or windows. At the same time, it is a disadvantage that two work processes are necessary before the screenshot is archived on the hard disk (first the screenshot, then archiving with the aid of a photo-editing programme.

see test details

Grab 1.2

Mac OS X

Included with Mac OS X

Grab (Dk.: 'Skærmbillede') can photograph the entire screen image, regions or windows. It is also possible to photograph with a time delay. The utility archives satisfactorily, although it is not possible to choose the image format to archive in; archiving is done in the un-compressed TIFF format.

see test details

SnagIt 7.1.2

Windows

39,95 EUR

SnagIt can make a screenshot of the full screen image, objects on the screen (such as windows), or areas of the screen image. The programme offers a high degree of flexibility, with editing options and a choice of archiving format as well as defining keyboard shortcuts for various archiving methods. The archived web pages (of which parts are visible on the screen at the time of archiving) appear correctly. The programme also includes an integrated image browser in which the archived screen shots can be reviewed.

see test details

Snapz Pro X 2.0 Mac OS X 69,00 USD

Snapz Pro X can make a screenshot of the entire screen image, objects on the screen such as windows, or areas of the screen image. The programme offers a high degree of flexibility, among other things the option of choosing archiving format, scale, frame, colour palette, and a preview of the screenshot before archiving. The archived web pages (of which parts are visible on the screen during archiving) are viewed correctly.

see test details

Software for archiving screen recordings

SnagIt 7.1.2

Windows

39,95 EUR

SnagIt can be used to record what is happening in an area of the screen image, an object on the screen, such as a window, or the entire screen image, as well as recording any sound occurring while recording is taking place. The programme offers a high degree of flexibility, with among other things, the option of choosing archiving format and quality as well as defining keyboard shortcuts for various archiving methods. The archived web pages and content elements appear correctly. One of the programme's strong points is that it compresses the archived material very rapidly.

see test details

Snapz Pro X 2.0

Mac OS X

69,00 USD

Snapz Pro X can be used to record what is taking place in an area of the screen image, an object on the screen such as a window, or the entire screen image, as well as recording any sound occurring while recording is taking place. The programme offers s high degree of flexibility, among other things the option of choosing archiving format and quality. The archived web pages or content elements appear correctly; all types of content elements are correctly archived. A disadvantage of the programme is that it stores and compresses the recorded content elements or web pages very slowly; in one case in this test, it took approx. 45 min. to compress a 15 min. recording (30 fps, 22500 KHz mono sound). New recordings cannot be made during the compression and storage process.

see test details

 

Internet and web-archiving resources

Micro-archiving websites

Complete websites

Individual web pages in a static state

Adobe Acrobat Professional 6.01
DeepVacuum 1.24
Microsoft Internet Explorer 6.0
Microsoft Internet Explorer 5.2.3
MM3-WebAssistant Professional 2005
WebHTTrack Website Copier 3.33
WebReaper 9.8
wget 1.9
WinHTTrack Website Copier 3.33

The function 'Save As PDF...'
Paparazzi! 0.1.8
PrimoPDF 1.0
SnagIt 7.1.2
Web2Pic 1.1
Screenshots Screen recording
The 'Print Screen' utility
Grab 1.2
SnagIt 7.1.2
Snapz Pro X 2.0
SnagIt 7.1.2
Snapz Pro X 2.0


Where to look for new software for internet archiving

Search engine Search key
Download.com offline browser
offline browsing
screen film
save website (/webpage)
web archiving
copy website (/webpage)
copier website (/webpage)
web snapshot
web screendump
Google.com
Macupdate.com
Snapfiles.com
Tucows.com
Versiontracker.com

 

Internet Archives

The Internet Archive
Archive-It (part of The Internet Archive, launced november 2005)
Minerva, Mapping the Internet Electronic Resources Virtual Archive (USA)
Our Digital Island, access to Tasmanian Web sites
Pandora, Australia's Web Archive
UKWAC, UK Web Archiving Consortium

Other relevant websites

Web Archiving - Bibliography, The AOLA Austria On-Line Archive Project
IIPC, International Internet Preservation Consortium
PADI, Preserving Access to Digital Information
nestor, Kompetenznetzwerk Langzeitarchivierung
netarchive.dk

Other tests of archiving software

Evaluation on Open Source Spidering Technology, by David Kellogg, 2004 (PDF-document)

Overview of archiving software

List of archiving software projects, The AOLA Austria On-Line Archive Project