Type | Name | Platform | Version | Price | URL | Remarks |
Complete websites | WebHTTrack Website Copier | UNIX/Mac OS X | 3.33-beta-3 | Free | An X-window version for Mac OS X of the open-source programme HTTrack Website Copier. This version of the programme is operated from the command line. A Windows-version (with graphic interface) can be downloaded from the programme website. (See the test of the Windows version WinHTTrack ) |
Conclusion |
WebHTTrack is a command-line operated offline browser that can archive websites’ source codes and remaining content, as well as converting these elements so that the archived versions are navigable. An advantage of archiving with HTTrack is that the archived material is archived in the format in which it is written, so that the archived pages appear in the browser, and one can work with the archived material just as with the online version. HTTrack cannot archive material requiring an online connection for viewing (typically chat, polls, test-yourself, streamed elements, most games). On the other hand, the programme is capable of converting web pages remarkably well, so that links usually work internally in the archived version, and web pages usually appear as they did online, with the exception of the online elements mentioned. Several archiving processes can be carried out at the same time with this programme, and archiving can be automated via scripts ( see an example of an AppleScript that starts 25 simultaneous archiving processes). |
Recommended settings |
A number of criteria must be fulfilled before WebHTTrack can be used on Mac OS X: The following parameters are recommended as a minimum for archiving: To indicate where the archived material is to be stored, it is recommended to use the parameter '-O' or '--path', followed by the desired path (such as /usr/local/bin/httrack "http://www.dr.dk" -O ~/websites/dr.dk + any further parameters). When archiving we usually want to copy the website to a local computer. To achieve this, the parameter '-w' or '--mirror' is used (the programme automaticly assumes that this parameter is the case if the parameters -i or -W are not specified). Alternatively, it is possible to use the parameter '-W' or '--mirror-wizard', where WebHTTrack offers dialogues during archiving when new domains meet in a link. The most efficient way of limiting the archiving to the desired web pages is to specify web adresses that are allowed and disallowed for the programme to archive. This is done by adding URLs with either a plus or minus (e.g. "/usr/local/bin/httrack www.dr.dk/nyheder/ -O ~/websites/dr.dk +http://www.dr.dk/nyheder/* -http://www.dr.dk/*" – note the use of wildcards). Additionally, a way to delimit the archiving of huge websites, is to archive (typically) max. 4 levels down in the chosen URL (i.e. to follow all links on the first page, and thereafter 3 more underlying pages – incl. any links on these underlying pages, etc.). This is done with the parameter '-r4' or '--depth=4'. Links that are not part of the archived domain can, if necessary, be defined to a number of levels, using the parameter '-%e1' or '--ext-depth=1' for one level. These limitations are necessary when only part of a complex website is to be archived – if the limitation is omitted, archiving complex websites will easily become impossibly extensive, comprising most of the website, or worse, the entire Internet. The limitation will of course depend on the purpose of a given archiving process. Flow-control is composed of a number of other parameters: Timeout should be set at 5 seconds ('-T5' or '--timeout=5') and number of attempts at 3 ('-R3' or '--retries=3'). This is done because WebHTTrack otherwise has a tendency to 'freeze' at elements that no longer exist, or for some other reason cannot be archived. Also, the use of flow control is used in order not to overload the website being archived; please keep the number of sockets at a reasonable level. Another reason for WebHTTrack's occasionally freezing is too many URLs being downloaded simultaneously. For this reason the programme should be limited to archive with max. 4 concurrent 'threads' with the parameter '-c4' or '--sockets=4'. If the archiving process freezes, you can attempt to restart the process, suing the '-i' or '--continue' instead of '-w'/'-W'/'--mirror'/'--mirrror-wizard'. A last recommended parameters are '-n' or '--near', which tells WebHTTrack that all content elements used for viewing a web page are to be included in the archiving and '-j' which helps archive pages that use java-script. There are many other options for settings, but the above-mentioned are the most necessary. To archive the website for the Centre for Internet Research, using the recommended parameters, use one of the following command strings: './httrack "http://cfi.imv.au.dk" -O "/netsteder/cfi.imv.au.dk" -w -r4 -%e1 -T5 -R3 -c4 -n -j' or './httrack "http://cfi.imv.au.dk" --path "/netsteder/cfi.imv.au.dk" --mirror --depth=4 --ext-depth=1 --timeout=5 --retries=3 --sockets=4 --near -j'. For further information on the programme and its (remaining) parameters, see: http://www.httrack.com/html/fcguide.html . |
Archiving speed | |||
Archiving time (min) | File size (MB) | Archiving speed (MB/min) | Degree of presence required |
34.5 | 90.4 | 2.62* | Low |
Test details |
Test date and time: Friday October 8 2004, 10 a.m. – 6 p.m. Tested by: Bo Hovgaard Thomasen Tested by archiving: : http://www.dr.dk/kroniken , http://www.dr.dk/nyheder , http://www.dr.dk/skum , Speed test carried out by archiving: http://www.dr.dk/nyheder/html/nyheder/baggrund/tema2003/krise/index.jhtml * Note that the messured archiving speed is affected of a deliberate limit in WinHTTrack , that per default limits the download rate to 25 kb/s. This defalut limit may be altered. |
The following have been evaluated according to the following scale for the number of archived elements: 0=none, 1=few, 2=average, 3=most, 4=all
Structure | |||
aa | 3 | aa | |
aa | Cascading Style Sheets | 4 | The archived material appears as defined in CSS. |
Page composition | 3 | Elements are correctly positioned on almost all archived web pages. | |
Background | 4 | All backgrounds are archived | |
Pop-up-windows | 2 | Many pop-up windows are not active in the archived version and elements such as photos are sometimes missing in the archived pop-up windows. | |
Archiving of all the desired web pages | 3 | Some errors appear, causing some of the underlying pages to be downloaded incorrectly or not downloaded at all, even though WebHTTrack is capable of archiving the type of element of which the pages are composed. This may be because the pages are slow and the archiving programme interrupts archiving of pages or elements after some seconds of inactivity (timeout), or that the structure of the pages is more complicated than what the programme is capable of archiving. | |
Movement between elements in the structure | |||
Link | 3 | aa | |
Print/writing | Textual link | 3 | All textual links are archived. However, textual links referring to JavaScript.are usually not active (although some links to less complicated JavaScript routines do work). This has consequences for such things as facts boxes (which are often used) on web pages like www.dr.dk/nyheder. |
Pull-down menu | 2 | Pull-down menus are archived, but only act as links to some degree in the archived version. | |
Formulas such as login | 0 | WebHTTrack does not allow archiving of active elements for login or filling in formulas (online elements are not archived). | |
Image | Animation | 4 | Animation (such as Macromedia Flash) acts as a link in the archived version. |
Graphics | 3 | All graphics links are archived and active (except some JavaScript links) | |
Photo | 3 | All photo links are archived and active (except some JavaScript links) | |
Moving images | - | Not tested | |
Link target | 2 | aa | |
Print/writing | Text | 4 | All text on the archived pages is included in the archived material. |
Image | Animation | 2 | Only animation not requiring an online connection is archived. |
Graphics | 3 | Graphics are usually archived. | |
Photo | 3 | Most photos are archived. | |
Moving images | 2 | Only moving images not requiring an online connection are archived | |
Other | - | ||
Sound | 2 | Only sound not requiring an online connection is archived | |
Automation | 4 | aa | |
aa | Automatic redirection | 4 | Automatic redirection is active. |
Movement in elements in the structure | |||
Automatic + inherent | 3 | aa | |
Print/writing | 4 | All movable text is archived. | |
Image | Animation | 4 | Flash- and Shockwave-elements are archived perfectly |
Moving images | 3 | Moving images usually archived correctly | |
Banner ads | 3 | Banner ads usually archived correctly. | |
Sound | Background sound | 3 | Background sound usually archived correctly |
Banner ads | 4 | Sound in banner ads archived correctly | |
Automatic + online | 0 | aa | |
Print/writing | Chat as reader | 1 | Elements requiring online connection cannot be archived using WebHTTrack. |
Image | Moving images | 0 | Elements requiring online connection cannot be archived using WebHTTrack. |
Sound | 0 | Elements requiring online connection cannot be archived using WebHTTrack. | |
User intervention + inherent | 3 | aa | |
Print/writing | Archived chat | - | Not tested |
Mouse-over | 4 | Mouse-over text is archived and active. | |
Quizzes | - | Not tested | |
Clickable maps | 4 | Clickable maps (such as Micromedia Flash) are archived and functional. | |
Image | Non-streamed image (such as slide show, clickable map) | 3 | Usually functional in the archived version |
Games | 1 | Games are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some games are correctly archived. | |
Quizzes | 1 | Quizzes are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some quizzes are correctly archived. | |
Clickable maps (w. zoom or activation) | 4 | Clickable maps (Macromedia Flash) are correctly archived and functional | |
Mouse-over | 4 | Mouse-over images are archived and active | |
Sound | Non-streamed sound (e.g. activated in games, quizzes, etc.) | 3 | Sound is archived and is usually functional in the archived version. |
Mouse-over | 3 | Sound is archived and is usually functional in the archived version. | |
User intervention + online | 0 | aa | |
Print/writing | Chat (as participant) | 0 | Elements requiring online connection cannot be archived using WebHTTrack. |
Polls | 0 | Elements requiring online connection cannot be archived using WebHTTrack. | |
Test-yourself | 0 | Elements requiring online connection cannot be archived using WebHTTrack. | |
Image | Streamed images | 0 | Elements requiring online connection cannot be archived using WebHTTrack. |
Games | 0 | Elements requiring online connection cannot be archived using WebHTTrack. | |
Sound | Streaming (both archived and live) | 0 | Elements requiring online connection cannot be archived using WebHTTrack. |
Non-movable elements | |||
3 | aa | ||
Print/writing | ss | 4 | All print/writing is correctly archived. |
Image | 3 | All images are correctly archived. | |
Sound | 0 | Sound is not archived. |
The test was carried out by graduate student Bo Hovgaard Thomasen during the period from July- December 2004, and its premises and main results are explicated in the text Test of software and strategies for micro-archiving websites.
Note: We do not have the resources to offer technical support or other advice on the use of the tested archiving programme beyond what can be found on this web page.