Aarhus University Seal

Test Of Achiving Software - WebHTTrack

Back to main outline

Type

Name

Platform

Version

Price

URL

Remarks

Complete websites

WebHTTrack Website Copier

UNIX/Mac OS X

3.33-beta-3

Free

http://www.httrack.com/

An X-window version for Mac OS X of the open-source programme HTTrack Website Copier. This version of the programme is operated from the command line. A Windows-version (with graphic interface) can be downloaded from the programme website. (See the test of the Windows version WinHTTrack )

 

Conclusion

WebHTTrack is a command-line operated offline browser that can archive websites’ source codes and remaining content, as well as converting these elements so that the archived versions are navigable. An advantage of archiving with HTTrack is that the archived material is archived in the format in which it is written, so that the archived pages appear in the browser, and one can work with the archived material just as with the online version.  HTTrack cannot archive material requiring an online connection for viewing (typically chat, polls, test-yourself, streamed elements, most games). On the other hand, the programme is capable of converting web pages remarkably well, so that links usually work internally in the archived version, and web pages usually appear as they did online, with the exception of the online elements mentioned.  Several archiving processes can be carried out at the same time with this programme, and archiving can be automated via scripts ( see an example of an AppleScript that starts 25 simultaneous archiving processes).

 

Recommended settings

A number of criteria must be fulfilled before WebHTTrack can be used on Mac OS X:
X-window (X11) must be installed on the computer (can be done from Mac OS 10.3). X11 can be downloaded from Apple’s websites at the following URL (44 MB):
http://www.apple.com/macosx/features/x11/download/ . A C-compiler must also be installed –the compiler from Apple’s developer package, 'Xcode Tools', was used for this test. The programme package is included with Mac OS X after version 10.3 (search for the 'Developer.mpkg' file or 'XcodeTools.mpkg', which will install the necessary programme packages. Note: 1.5 GB of free hard disk space is required). 'Xcode Tools' can be downloaded from Apple after registering as a developer at (372 MB): http://developer.apple.com/tools/download/ .
After installing X11 and Xcode Tools, WebHTTrack’s source code must be compiled.


This is done as follows:
(1) Copy the WebHTTrack files –the HTTrack-3.33.02 folder (note that the folder name varies with programme versions) – to your home folder (in my case 'bht'), (2) Start the Terminal programme (from the folder Applications/Utilities), (3) Type the command 'cd HTTrack-3.33.02' (4) Type the command './configure' (configuration can take several minutes), (5) Type the command 'make' (can take several minutes), (6) Type the command 'sudo make install' (requires a root level password, can take several minutes). The programme is now ready for use and can be found in the folder '/usr/local/bin'.

To archive using WebHttrack, proceed as follows:
WebHTTrack has a large number of setting options, which are accessed by adding parameters to the command './httrack' or /usr/local/bin/httrack. The possible parameters can be found using the '--help' parameter (as in: ./httrack --help', or in the documentation 'httrack-doc.html' in the HTTrack-3.33.02 folder, or at
http://www.httrack.com/html/fcguide.html . When the programme is run without parameters, a number of dialogues appear, in which the archiving process can be named and defined.


The following parameters are recommended as a minimum for archiving:

To indicate where the archived material is to be stored, it is recommended to use the parameter  '-O' or '--path', followed by the desired path (such as /usr/local/bin/httrack "http://www.dr.dk" -O ~/websites/dr.dk + any further parameters).
When archiving we usually want to copy the website to a local computer. To achieve this, the parameter '-w' or '--mirror' is used (the programme automaticly assumes that this parameter is the case if the parameters -i or -W are not specified). Alternatively, it is possible to use the parameter '-W' or '--mirror-wizard', where WebHTTrack offers dialogues during archiving when new domains meet in a link.

The most efficient way of limiting the archiving to the desired web pages is to specify web adresses that are allowed and disallowed for the programme to archive. This is done by adding URLs with either a plus or minus (e.g. "/usr/local/bin/httrack www.dr.dk/nyheder/ -O ~/websites/dr.dk  +http://www.dr.dk/nyheder/* -http://www.dr.dk/*" – note the use of wildcards).
Additionally, a way to delimit the archiving of huge websites, is to archive (typically) max. 4 levels down in the chosen URL (i.e. to follow all links on the first page, and thereafter 3 more underlying pages – incl. any links on these underlying pages, etc.). This is done with the parameter '-r4' or '--depth=4'. Links that are not part of the archived domain can, if necessary, be defined to a number of levels, using the parameter '-%e1' or '--ext-depth=1' for one level. These limitations are necessary when only part of a complex website is to be archived – if the limitation is omitted, archiving complex websites will easily become impossibly extensive, comprising most of the website, or worse, the entire Internet. The limitation will of course depend on the purpose of a given archiving process.

Flow-control is composed of a number of other parameters: Timeout should be set at 5 seconds ('-T5' or '--timeout=5') and number of attempts at 3 ('-R3' or '--retries=3'). This is done because WebHTTrack otherwise has a tendency to 'freeze' at elements that no longer exist, or for some other reason cannot be archived. Also, the use of flow control is used in order not to overload the website being archived; please keep the number of sockets at a reasonable level. Another reason for WebHTTrack's occasionally freezing is too many URLs being downloaded simultaneously. For this reason the programme should be limited to archive with max. 4 concurrent 'threads' with the parameter '-c4' or '--sockets=4'.

If the archiving process freezes, you can attempt to restart the process, suing the
'-i' or '--continue' instead of '-w'/'-W'/'--mirror'/'--mirrror-wizard'. A last recommended parameters are '-n' or '--near', which tells WebHTTrack that all content elements used for viewing a web page are to be included in the archiving and '-j' which helps archive pages that use java-script. There are many other options for settings, but the above-mentioned are the most necessary.

To archive the website for the Centre for Internet Research, using the recommended parameters, use one of the following command strings: './httrack "http://cfi.imv.au.dk" -O "/netsteder/cfi.imv.au.dk" -w -r4 -%e1 -T5 -R3 -c4 -n -j' or './httrack "http://cfi.imv.au.dk" --path "/netsteder/cfi.imv.au.dk" --mirror --depth=4 --ext-depth=1 --timeout=5 --retries=3 --sockets=4 --near -j'.

For further information on the programme and its (remaining) parameters, see:
http://www.httrack.com/html/fcguide.html .

Top

 

Archiving speed

Archiving time (min)

File size (MB)

Archiving speed (MB/min)

Degree of presence required

34.5

90.4

2.62*

Low

 

Test details

Test date and time: Friday October 8 2004, 10 a.m. –  6 p.m.

Tested by: Bo Hovgaard Thomasen

Tested by archiving: : http://www.dr.dk/kroniken , http://www.dr.dk/nyheder , http://www.dr.dk/skum ,
http://www.enmandsavisen.dk/ , http://www.dr.dk/spil , nyhederne.tv2.dk

Speed test carried out by archiving: http://www.dr.dk/nyheder/html/nyheder/baggrund/tema2003/krise/index.jhtml

* Note that the messured archiving speed is affected of a deliberate limit in WinHTTrack , that per default limits the download rate to 25 kb/s. This defalut limit may be altered.

Top

 

Test results

The following have been evaluated according to the following scale for the number of archived elements: 0=none, 1=few, 2=average, 3=most, 4=all

Structure

aa

3

aa

aa

Cascading Style Sheets

4

The archived material appears as defined in CSS.

Page composition

3

Elements are correctly positioned on almost all archived web pages.

Background

4

All backgrounds are archived

Pop-up-windows

2

Many pop-up windows are not active in the archived version and elements such as photos are sometimes missing in the archived pop-up windows.

Archiving of all the desired web pages

3

Some errors appear, causing some of the underlying pages to be downloaded incorrectly or not downloaded at all, even though WebHTTrack is capable of archiving the type of element of which the pages are composed. This may be because the pages are slow and the archiving programme interrupts archiving of pages or elements after some seconds of inactivity (timeout), or that the structure of the pages is more complicated than what the programme is capable of archiving.

Movement between elements in the structure

Link

3

aa

Print/writing

Textual link

3

All textual links are archived. However, textual links referring to JavaScript.are usually not active (although some links to less complicated JavaScript routines do work). This has consequences for such things as facts boxes (which are often used) on web pages like www.dr.dk/nyheder.

Pull-down menu

2

Pull-down menus are archived, but only act as links to some degree in the archived version.

Formulas such as login

0

WebHTTrack does not allow archiving of active elements for login or filling in formulas (online elements are not archived).

Image

Animation

4

Animation (such as Macromedia Flash) acts as a link in the archived version.

Graphics

3

All graphics links are archived and active (except some JavaScript links)

Photo

3

All photo links are archived and active (except some JavaScript links)

Moving images

-

Not tested

Link target

2

aa

Print/writing

Text

4

All text on the archived pages is included in the archived material.

Image

Animation

2

Only animation not requiring an online connection is archived.

Graphics

3

Graphics are usually archived.

Photo

3

Most photos are archived.

Moving images

2

Only moving images not requiring an online connection are archived

Other

-

Sound

2

Only sound not requiring an online connection is archived

Automation

4

aa

aa

Automatic redirection

4

Automatic redirection is active.

Movement in elements in the structure

Automatic + inherent

3

aa

Print/writing

4

All movable text is archived.

Image

Animation

4

Flash- and Shockwave-elements are archived perfectly

Moving images

3

Moving images usually archived correctly

Banner ads

3

Banner ads usually archived correctly.

Sound

Background sound

3

Background sound usually archived correctly

Banner ads

4

Sound in banner ads archived correctly

Automatic + online

0

aa

Print/writing

Chat as reader

1

Elements requiring online connection cannot be archived using WebHTTrack.

Image

Moving images

0

Elements requiring online connection cannot be archived using WebHTTrack.

Sound

0

Elements requiring online connection cannot be archived using WebHTTrack.

User intervention + inherent

3

aa

Print/writing

Archived chat

-

Not tested

Mouse-over

4

Mouse-over text is archived and active.

Quizzes

-

Not tested

Clickable maps

4

Clickable maps (such as Micromedia Flash) are archived and functional.

Image

Non-streamed image (such as slide show, clickable map)

3

Usually functional in the archived version

Games

1

Games are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some games are correctly archived.

Quizzes

1

Quizzes are archived poorly, because they are usually constructed with online elements (reporting high scores to the website, etc.). However, some quizzes are correctly archived.

Clickable maps (w. zoom or activation)

4

Clickable maps (Macromedia Flash) are correctly archived and functional

Mouse-over

4

Mouse-over images are archived and active

Sound

Non-streamed sound (e.g. activated in games, quizzes, etc.)

3

Sound is archived and is usually functional in the archived version.

Mouse-over

3

Sound is archived and is usually functional in the archived version.

User intervention + online

0

aa

Print/writing

Chat (as participant)

0

Elements requiring online connection cannot be archived using WebHTTrack.

Polls

0

Elements requiring online connection cannot be archived using WebHTTrack.

Test-yourself

0

Elements requiring online connection cannot be archived using WebHTTrack.

Image

Streamed images

0

Elements requiring online connection cannot be archived using WebHTTrack.

Games

0

Elements requiring online connection cannot be archived using WebHTTrack.

Sound

Streaming (both archived and live)

0

Elements requiring online connection cannot be archived using WebHTTrack.

Non-movable elements

3

aa

Print/writing

ss

4

All print/writing is correctly archived.

Image

3

All images are correctly archived.

Sound

0

Sound is not archived.


Back to main outline

The test was carried out by graduate student Bo Hovgaard Thomasen during the period from July- December 2004, and its premises and main results are explicated in the text Test of software and strategies for micro-archiving websites.

Note: We do not have the resources to offer technical support or other advice on the use of the tested archiving programme beyond what can be found on this web page.