Aarhus University Seal / Aarhus Universitets segl

Test Of Archiving Software - WinHTTrack

 

Back to main outline

Type

Name

Platform

Version

Price

URL

Remarks

Complete websites

WinHTTrack Website Copier

Windows

3.33-beta-3

Free

www.httrack.com

A Windows version of the open-source programme HTTrack Website Copier. Various UNIX versions can be downloaded from the programme's website (see test of X-windows version for Mac OS X: WebHTTrack )


Conclusion

WinHTTrack is an offline browser that can archive websites' source codes and remaining content, as well as converting these elements so that the archived versions are navigable. An advantage of archiving with WinHTTrack is that the archived material is archived in the format in which it is written, so that the archived pages appear in the browser. WinHTTrack cannot archive material requiring an online connection for viewing (typically chat, polls, test-yourself, streamed elements, most games). On the other hand, the programme is capable of converting web pages remarkably well, so that links usually work internally in the archived version, and web pages usually appear as they did online, with the exception of the online elements mentioned. It is possible to carry out several archiving processes at the same time with this programme. A command line version for MS-DOS is included in the WinHTTrack software package, which makes automation of the archiving process possible by the use of batch scripts.


Recommended settings


Graphical User Interface version (Windows)

WinHTTrack has a large number of options for settings, but a GUI (Graphical User Interface) simplifies configuring the archiving process. This is done via a guide, where the archiving process is first given a name, and the option ë transfer websiteí is chosen (next page). Next, the URL to be archived is entered. The remaining settings are chosen with the aid of the button 'define settings', where the following settings are recommended:

The filter tab: The most efficient way of limiting the archiving to the desired web pages is to specify web adresses that are allowed and disallowed for the programme to archive. This is done by adding URLs with either a plus or minus (e.g. "+http://www.dr.dk/nyheder/* -http://www.dr.dk/*" – note the use of wildcards). Limitations are necessary when only part of a complex website is to be archived – if there are no limitations, archiving complex websites will easily become impossibly extensive, and include most of the website. This type of limitation is of course dependent on the purpose of the given archiving process.

The ílimitations' tab : A further way to limit archiving is to set number of downward levels in the chosen URL to (typically) max. 5 (i.e. follow all links on the first page and then 3 underlying pages – incl. any links that might be found on these underlying pages, etc.). A limit can also be set on the extent to which archiving is to include external websites (outside the chosen domain).

The 'Flow-kontrol' tab: Timeout is set to 3 seconds and the number of attempts to 2. This is done because WinHTTrack otherwise has a tendency to 'freeze' at elements which for instance no longer exist, or cannot be archived for some other reason.


Command line version (MS-DOS)

WinHTTrack includes an command line version for MS-DOS (called HTTrack). To use the command line version, start a MS-DOS prompt and type in 'cd C:\Program Files\WinHTTrack', Then type in 'httrack' optimally followed by a number of parameters that can refine the archiving.

To archive using the MS-DOS version of HTTrack, proceed as follows:
HTTrack has a large number of setting options, which are accessed by adding parameters to the command 'httrack'. The possible parameters can be found in the '--help' parameter  (as in: httrack --help' or at www.httrack.com/html/fcguide.html. When the programme is run without parameters, a number of dialogues appear, in which the archiving process can be named and defined.

The following parameters are recommended as a minimum for archiving:
To indicate where the archived material is to be stored, it is recommended to use the parameter  '-O' or '--path', followed by the desired path  (such as ‘httrack www.dr.dk -O C:\webpages\dr.dk’ + any further parameters).

When archiving we usually want to copy the website to a local computer, to mirror the website on the local computer. To achieve this, the parameter '-w' or '--mirror' is used. Alternatively, it is possible to use the parameter '-W' or '--mirror-wizard', where HTTrack offers dialogues during archiving when new domains  meet in a link. In order to delimit archiving, it is a good idea to archive (typically) max. 4 levels down in the chosen URL (i.e. to follow all links on the first page, and thereafter 3 more underlying pages – incl.  any links on these underlying pages, etc.). This is done with the parameter  '-r4' or '--depth=4'. Links that are not part of the archived domain can, if necessary, be defined to a number of levels, using the parameter '-%e1' or '--ext-depth=1' for one level. These limitations are necessary when only part of a complex website is to be archived – if the limitation is omitted, archiving complex websites  will easily become  impossibly extensive, comprising most of the website,  or worse, the entire Internet. The limitation will of course depend on the purpose of a given archiving process.

Flow-control is composed of a number of other parameters: Timeout should be set at 5 seconds ('-T5' or '--timeout=5') and number of attempts at 3 ('-R3' or '--retries=3'). This is done because HTTrack otherwise has a tendency to 'freeze' at elements that no longer exist, or for some other reason cannot be archived. Another reason for HTTrack's occasionally freezing is too many URLs being downloaded simultaneously. For this reason the programme should be limited to archive with max. 4 concurrent 'threads' with the parameter '-c8' or '--sockets=8'. If the archiving process freezes, you can attempt to restart the process, suing the  '-i' or '--continue' instead of '-w'/'-W'/'--mirror'/'--mirrror-wizard'. A last recommended parameter is '-n' or '--near', which tells WebHTTrack that all content elements used for viewing a web page are to be included in the archiving. There are many other options for settings, but the above-mentioned are the most necessary.

To archive the website for the Centre for Internet Research, using the recommended parameters, use one of the following command strings: 'httrack cfi.imv.au.dk -O C:\webpages\cfi.imv.au.dk -w -r4 -%e1 -T5 -R3 -c4 -n' or 'httrack cfi.imv.au.dk --path C:\webpages\cfi.imv.au.dk --mirror --depth=4 --ext-depth=1 --timeout=5 --retries=3 --sockets=4 --near'.

For further information on the programme and its (remaining) parameters,  see: www.httrack.com/html/fcguide.html.


Archiving speed

Archiving time (min)

File size (MB)

Archiving speed (MB/min)

Degree of presence required

47.5

90.7

1.91*

Low


Test details


Test date and time: Friday October 8 2004, 10 a.m. – 6 p.m.

Tested by: Bo Hovgaard Thomasen

Tested by archiving: www.dr.dk/kroniken, www.dr.dk/nyheder, www.dr.dk/skum, www.dr.dk/skum/boogie

Speed test carried out by archiving: www.dr.dk/nyheder/html/nyheder/baggrund/tema2003/krise/index.jhtml

* Note that the messured archiving speed is affected by a deliberate limit in WinHTTrack, that limits the download rate to 25 kb/s. This defalut limit may be altered.

 

Test results

The following have been evaluated according to the following scale for the number of archived elements: 0=none, 1=few, 2=average, 3=most, 4=all

Structure

aa

3

aa

aa

Cascading Style Sheets

4

The archived material appears as defined in CSS.

Page composition

3

Elements are correctly positioned on almost all archived web pages.

Background

4

All backgrounds are archived

Pop-up-windows

2

Many pop-up windows are not active in the archived version and elements such as photos are sometimes missing in the archived pop-up windows.

Archiving of all the desired web pages

2

Some errors appear, causing some of the underlying pages to be downloaded incorrectly or not downloaded at all, even though WinHTTrack is capable of archiving the type of element of which the pages are composed. This may be because the pages are slow and the archiving programme interrupts archiving of pages or elements after some seconds of inactivity (timeout), or because the structure of the pages is more complicated than what the programme is capable of archiving.

Movement between elements in the structure

Link

3

aa

Print/writing

Textual link

4

All textual links are archived. However, some textual links referring to JavaScript.are not active (although some links to less complicated JavaScript routines do work). This has consequences for such things as facts boxes (which are often used) on web pages like www.dr.dk/nyheder.

Pull-down menu

2

Pull-down menus are archived, but act only to some degree as links in the archived version.

Formulas such as login

0

WinHTTrack does not allow archiving of active elements for login or filling in formulas (online elements are not archived).

Image

Animation

4

Animation (such as Macromedia Flash) acts as a link in the archived version.

Graphics

3

All graphics links are archived and active (except some JavaScript links)

Photo

3

All photo links are archived and active (except some JavaScript links)

Moving images

-

Not tested

Link target

2

aa

Print/writing

Text

4

All text on the archived pages is included in the archived material.

Image

Animation

2

Only animation not requiring an online connection is archived.

Graphics

3

Graphics are usually archived.

Photo

3

Most photos are archived.

Moving images

2

Only moving images not requiring an online connection are archived

Other

-

aa

Sound

2

Only sound not requiring an online connection is archived

Automation

4

aa

aa

Automatic redirection

4

Automatic redirection is active.

Movement in elements in the structure

Automatic + inherent

3

aa

Print/writing

4

All text is archived.

Image

Animation

4

Flash- and Shockwave-elements are archived perfectly

Moving images

3

Moving images usually archived correctly

Banner ads

3

Banner ads usually archived correctly.

Sound

Background sound

3

Background sound usually archived correctly

Banner ads

4

Sound in banner ads archived correctly

Automatic + online

0

aa

Print/writing

Chat as reader

1

Elements requiring online connection cannot be archived using WinHTTrack.

Image

Moving images

0

Elements requiring online connection cannot be archived using WinHTTrack.

Sound

0

Elements requiring online connection cannot be archived using WinHTTrack

User intervention + inherent

3

aa

Print/writing

Archived chat

-

Not tested

Mouse-over

4

Mouse-over images are correctly archived and functional

Quizzes

-

Not tested

Clickable maps

4

Clickable maps (such as Micromedia Flash) are archived and functional.

Image

Non-streamed image (such as slide show, clickable map)

3

Usually functional in the archived version

Games

1

Games are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some games are correctly archived.

Quizzes

1

Quizzes are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some quizzes are correctly archived.

Clickable maps (w. zoom or activation)

4

Clickable maps (Macromedia Flash) are correctly archived and functional

Mouse-over

4

Mouse-over images are correctly archived and functional

Sound

Non-streamed sound (e.g. activated in games, quizzes, etc.)

3

Sound is archived and is usually functional in the archived version.

Mouse-over

3

Sound is archived and is usually functional in the archived version.

User intervention + online

0

aa

Print/writing

Chat (as participant)

0

Elements requiring online connection cannot be archived using WinHTTrack.

Polls

0

Elements requiring online connection cannot be archived using WinHTTrack.

Test-yourself

0

Elements requiring online connection cannot be archived using WinHTTrack.

Image

Streamed images

0

Elements requiring online connection cannot be archived using WinHTTrack.

Games

0

Elements requiring online connection cannot be archived using WinHTTrack.

Sound

Streaming (both archived and live)

0

Elements requiring online connection cannot be archived using WinHTTrack.

Non-movable elements

aa

3

aa

Print/writing

ss

4

All print/writing is correctly archived.

Image

3

All images are correctly archived.

Sound

0

Sound is not archived.


Back to main outline

The test was carried out by graduate student Bo Hovgaard Thomasen during the period from July- December 2004, and its premises and main results are explicated in the text Test of software and strategies for micro-archiving websites.

Note: We do not have the resources to offer technical support or other advice on the use of the tested archiving programme beyond what can be found on this web page.