Computer

Save Web pages and archive them with ArchiveBox: here is your Internet Archive

Save Web pages and archive them with ArchiveBox: here is your Internet Archive

The Web offers a boundless volume of information useful for your business or profession. The problem is to preserve these data and keep track of them so that they are not lost. The contents published online are by definition dynamic: the author can modify them as he wishes, remove them or they can be deleted, for example, because the hosting service provider is no longer paid. Internet Archive it has been the memory of the Web for almost three decades. It is an initiative that archives and offers access to a wide range of information published online.

As we explained in an article dedicated to what the Internet Archive is, the service provides save web pages over time, generally keeping multiple copies. This way, for example, you can see how a single piece of content or an entire website changes over time. The copies of the websites stored and kept by the Internet Archive are usually also “navigable”: it means that it is possible to move between the pages of a website (even when it no longer exists…), as it appeared some time ago, by clicking on the links (hyperlinks) present in page.

This specific service is called Wayback Machine but you can really find everything. For example, Internet Archive allows you to download ISOs of Windows and other software. For example, it happened that we were no longer able to find the ISO of the old Windows 7 Starter: well, by paying attention to the origin of the file and its digital signature, it is possible to find it.

Build your own Internet Archive: how to save Web pages with ArchiveBox

A bit like the Internet Archive does (this was Google’s home page at the end of 1998), ArchiveBox is a solution that allows you to save web pages by creating a copy of them. The software allows you to store the HTML page together with all the elements that compose it on a local medium, for example on-premises o sul cloud.

Until some time ago, many users referred to online tools for archive websites and make copies of the pages. Last year, however, some companies began to crack down. To limit ourselves only to Reddit, the well-known social platform has begun to prevent the archiving of its pages by third parties. The sites that allowed you to save content published on Reddit have closed their doors and the information previously stored by users has disappeared.

ArchiveBox: what it is and how it works

Why trust an online storage service if you can use zero-cost software to store the same data, make it conveniently searchable and avoid losing it?

If this has ever happened to you add to bookmarks an important resource published on the web only to later discover that it is no longer available, you know how frustrating this can be.

What ArchiveBox is and how it works

Internet Archive is an excellent service but, obviously, it can’t keep up trace of all published pages on the Web. Furthermore, just as an example, it is unable to acquire Facebook. You can manually ask to start storing a set of contents: the procedure fails, however, if the file robots.txt prevents web scraping activities.

The growing use of Javascript and embedded video content also makes asset acquisition and storage more difficult. Just check how, ai archived sites on the Internet Archive, much of the original functionality is missing.

ArchiveBox is an open source tool designed to work as a system Personal web collection. Users can save a static copy of a web page and all its associated content. The application allows you to create your own personal archive, which can subsequently be consulted even if the content original were to become inaccessible or it should be removed.

L’installing ArchiveBox it takes place on your own local system, on a NAS or in the cloud, for example on a virtual machine purchased for this purpose. Users can configure ArchiveBox by specifying parameters such as archive directories, include and exclude filters, and other details.

ArchiveBox download the web page and all its contents (HTML, CSS, JavaScript, images,…) and saves them locally. This process creates a static copy of the web page at the time it is archived. Periodically, an archive update can be performed to ensure that web pages are still accessible and up to date. The application independently manages updates and the removal of obsolete content.

Where and how to install ArchiveBox

One of the main ones advantages of ArchiveBox is the possibility of installing it on a large number of platforms: it is compatible with management via package managerin all the main ones Linux distributions. It can therefore also be installed on a server NASwhose operation – as is known – is usually based on the Linux kernel.

Alternatively, regardless of the platform you use (Linux, Windows, macOS), you can use Docker Compose to load and run ArchiveBox in containerized form.

Docker Compose is a tool that makes it easy to define and run Docker applications multi-container. Thanks to Docker, as is the case with other applications, ArchiveBox is “packaged” – together with all its components and dependencies – in an environment isolated from the rest of the system (the containerprecisely).

Installation with Docker Compose assumes that you have both Docker and this software installed on your system. You can now start the installation using the following syntax from the terminal window:

mkdir archivebox && \
cd archivebox && \
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml' && \
docker compose run archivebox init --setup

The first two commands create a folder called archivebox and access it to then start the download the installation file and configuration docker-compose.yml. The last command initializes ArchiveBox and configures the work environment. It is a set of tasks that involves creating files, configuring environment variables, and other tasks necessary to prepare the application.

The following command docker compose up starts the services defined in the file docker-compose.yml: The command creates and loads containers for all services specified in the Docker composition file. In the case of ArchiveBox, you will then simply launch a web browser on your local system and then type 127.0.0.1:8000 in the address bar to start working with the web page archiving system.

Alternatively, you can click on the various sections in Quickstart to obtain specific instructions for installing ArchiveBox within your environment. For example, the guide contains instructions to install ArchiveBox using package manager.

How to save web pages and make them searchable locally

Once ArchiveBox is installed, the button Add present in the top bar of the web interface allows you to add the list of Web pages to save locally.

The application allows you to specify a single URL of the page to be stored or enter, one per line, multiple separate addresses. This way ArchiveBox creates a copy of all specified pages. Setting 0 as depth (depth), ArchiveBox acquires only the contents of each indicated URL; otherwise you can ask the application to follow all the links on the page (limiting yourself to just one level).

Save web pages with ArchiveBox

For each URL, ArchiveBox downloads all the content of the page and allows you to access – with a simple click – the copy stored in locale. The information is accessible in its original format, in PDF, it is possible to obtain only the list of multimedia files or, again, access the source of the page without further references. There is also it screenshot in PNG format of each page.

Furthermore, you can benefit from the format Single-file HTML: it is a package that keeps all the elements of the page in a single container (HTML, CSS, JavaScript, multimedia files,…). The Images they are automatically encoded in Base64 so as to be manageable within the same file.

ArchiveBox also integrates everything you need to save videos together with the respective descriptions and salient data. Where other applications fail, therefore, ArchiveBox offers itself as a archivist particularly effective.

Store copy web pages

Still using the Web interface, ArchiveBox allows users to look for information in the archive and identify the contents you need. It is also possible export archives so you can share them and make them accessible on other devices.

It goes without saying that the application also offers a CLI (command line interface) which allows you to manage the information in the archive and add new ones using the command line.

Opening image credit: iStock.com – D3Damon

Leave a Reply

Your email address will not be published. Required fields are marked *