Public data backups

These notes grew out of those on personal data storage, which cover the technical means. I used to keep a local music collection since the times before broadband and unmetered connectivity around here, and generally preferred to avoid reliance on online services, particularly commercial ones, since those tend to let users down. As the local censorship advanced, complete with a partial Internet blackout, and threatening to impose a complete blackout, while inexpensive storage device capacities increased, I started storing more of public data, in addition to my private data backups.

Apart from Internet blackouts or individual resource blocking by a government, usual data sources may become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user), or due to the publisher changing their policies. These notes include suggestions on the kinds of public data to backup, along with links to some of them, their size estimates.

Texts

Written works tend to be the most information-dense, making it easy to collect and store much more of those than one could hope to read in a lifetime.

Kiwix (with its OpenZIM archives) is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 75,000 public domain books by 2026), Wikipedia, Wikisource, Wikibooks, Wikiversity, Wiktionary, ready.gov, WikiHow, various StackExchange projects, Khan Academy, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics).

textfiles.com provides archives of files grouped by category, which are well-compressed, curious, and entertaining. RFC Editor bulk retrieval ceased to serve readily available archives by 2026, but one can rsync it, optionally archiving and compressing afterwards, e.g.:

rsync -avz --delete rsync.rfc-editor.org::rfcs-text-only/ rfcs-text-only/
tar --group=nogroup --owner=nobody -czf rfcs-text-only.tgz rfcs-text-only/

The POSIX (SUS) specification is useful to have at hand: POSIX.1-2024 is available as an archive (see "Downloads"). Along those lines, there are programming language specifications (reports), and other relevant specifications and references: ISO C, Haskell Language Report, Scheme Reports, Python documentation downloads, RISC-V specification, Intel 64 and IA-32 Architectures Software Developer's Manual, AMD64 Architecture Programmer's Manual, Linux Foundation Referenced Specifications, USB specifications, Bluetooth specifications, ACPI and UEFI specifications, PostgreSQL manual, XMPP Extension Protocols, etc.

Then there are copyright-infringing but much larger libraries like Library Genesis (a trimmed down, txt-only version used to be available at offlineos.com, but apparently not anymore), the-eye.eu books, Anna's Archive, Z-library. The Pirate Bay or similar torrent trackers may help to find book collections, including MIT mathematics and physics books, Cambridge Histories and philosophy companion books, Oxford "Very Short Introductions", Routledge books. As well as works grouped by an author (e.g., Gardner, Feynman). Other topics to consider acquisition of modern (text)books on: major philosophy works, electronics and radio, engineering, sociology, economics, computing, cooking, physical exercises, survival, fiction, medicine (e.g., the Merck manual), any topics of interest and other sciences. Other individual books on physics and mathematics, history. Consider the list of books complementary to Wikisource and PG (about 30 GB). Literary awards and charts can be handy for finding books: Pulitzer, Nebula, Locus, Bentley, Booker, Nature's analysis of the 100 most cited papers, The Guardian's top 100 books of all time, The Guardian's 100 best novels written in English, The NYT's 100 Best Books of the 21st Century, Discover magazine's 25 Greatest Science Books of All Time, and similar lists. UN and other organizations' reports may also be of interest.

OpenStax provides good and freely available textbooks under the CC BY license, available for download in PDF. See OpenStax GitHub repositories for their CNXML sources and related tools, though in 2024 I found it tricky to build HTML out of those, and then it still was not good enough for printing. LibreTexts is supposed to be similar, though the licensing information is unclear in some cases, some links lead to HTTP 404 errors, and some of the books are quite messy (attempting to embed YouTube videos into PDFs, having every other page filled with listings of undeclared licenses, or with "welcome" messages). While its subdomains (math, phys, etc) geo-block direct requests from Russia, the books are available without proxying via commons.libretexts.org. One can also search for libre book sources on platforms like GitHub, possibly querying for TeX sources: there are occasional seemingly decent and not well-known textbooks, like Introductory Physics: Building Models to Describe Our World, An Infinitely Large Napkin.

As of 2026, all those (Wikipedia, Wikisource, Wiktionary, Project Gutenberg, OpenStax and other complementary books) would take just 400 to 500 GB, even with images and some non-English versions added. While much of programming documentation, particularly manuals, library references, and sources, is available from system repositories.

Book compression

EPUBs (basically ZIP archives with HTML and images) can be compressed by compressing individual images within those. Sometimes files can be removed from an EPUB archive, and it can be trimmed down by passing through pandoc (which would remove included fonts, for instance).

One can try to reduce PDF size (compress the images) with GhostScript or ImageMagick, among others, sometimes reducing the size by an order of magnitude: see "Efficient PDF optimization with Ghostscript CLI". For instance: gs -q -sDEVICE=pdfwrite -dPDFSETTINGS=/screen -dCompatibilityLevel=1.4 -o out.pdf in.pdf (possibly with -dCompressFonts=true and other options). Its -dFirstPage=$START -dLastPage=$END options are also handy sometimes, to extract pages of interest (including cases when some crackpottery is attached to books: that is one of the ways in which the crackpots try to promote it).

Then there are PDFs of scanned books, most of the information in which is graphical details of book pages, often in color. Considerable space can be saved by converting those into black and white, getting rid of the page backgrounds. For instance:

# Convert a page at a time, otherwise imagemagick
# hogs too much space in $TMP
for p in $(seq 4 435); do
  convert -density 150 in.pdf[$p] -colorspace Gray -fuzz 20% \
  -opaque white -monochrome out/$(printf "%03d" $p).png;
done
# Combine the images and compress the result
mutool draw -o out.pdf out/*.png
mutool clean -gggg -i -f out.pdf out_clean.pdf
# Run an OCR to recreate a text layer
ocrmypdf -j 2 -l eng --title 'book title here' out_clean.pdf out_ocr.pdf

Software

Apart from censoring books and the Internet, dictatorships like to issue "national operating systems" and mandate their spyware, or simply disrupting connections to system repositories as collateral damage, so backing up software can also be useful.

Software sources are particularly useful to backup for potential isolated usage, ensuring the ability to study and customize those, but one needs some binaries to bootstrap a system. Some of the options to consider are (with size estimates from January of 2026):

Debian archive mirroring: about 230 GB when done with debmirror, for amd64 trixie (13.3) with sources. While the Mirror Size page lists numbers for mirroring all suites. One may also consider usage of a caching proxy server, apt-cacher-ng, and Modifying Debian CD. Unlike most others, Debian repositories contain all the source packages, which include upstream sources.
Slackware downloads: a whole mirror (for a single version) is under 20 GB, but it has few packages, and rather dated as well. But seems to be one of the few distributions with complete sources.
Gentoo source mirrors, particularly distfiles, almost 600 GB. Those include multiple versions of the same programs.
Arch Linux Mirrors take a little over 110 GB for packages ("pool"), and 31 GB for sources (though the wiki claims it is 80 GB and 110 GB, respectively; also most mirrors do not seem to host sources); apparently sources for many packages are not present.
Fedora mirroring: about 356 GB for "Everything" x86_64 packages, 123 GB for source ones.
OpenBSD mirrors: the sources may be in distfiles directories (as used by OpenBSD ports), but I have not found mirrors with such directories available via rsync.
NetBSD mirrors: about 200 GB in distfiles, under 70 GB for precompiled amd64 packages. Those include multiple versions of the same programs.

Debian, in addition to being an all-around good system, seems to be a good option for such mirroring as well. The mirroring itself is done rather easily:

sudo apt install debmirror debian-keyring
gpg --no-default-keyring --keyring trustedkeys.gpg --import /usr/share/keyrings/debian-archive-keyring.gpg
gpg --list-keys --keyring trustedkeys.gpg
debmirror -v -d trixie -a amd64 --source -h mirror.mephi.ru --method=rsync /mnt/backup/debian/mirror/

An up-to-date live Debian CD/USB image is useful to store along with it, and perhaps a Debian wiki dump. As well as necessary additional firmware for one's hardware, and possibly firmware for devices other than regular computers, such as OpenWRT images for routers, GrapheneOS or LineageOS images for phones and tablets (along with individual program distributions, APKs; some software I use is listed in the note on mobile computing), KOReader for e-readers. Consider F-Droid mirroring and OpenWRT source code saving, or backups of individual packages.

Audio

As mentioned in the introduction, I always kept a music collection, and probably this is quite common. While musical records may seem less important than books and other written works, they still have a cultural value, provide entertainment. My music discovery note seems relevant here.

Audibooks (including BBC radio collections) may also be useful to collect, even if one does not listen to those normally.

Video

Also for cultural and entertainment purposes, there are movies, and particularly long TV series may be suitable for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia; for humorous ones, see Black Books, The IT Crowd, Taskmaster, plenty of sitcoms.

Music videos are nice to have around, for the same reasons.

Lectures and educational videos on varied subjects can be both useful in addition to books (as 3blue1brown, providing useful illustrations and intuitive explanations, or various arts and crafts, or exercises, demonstrating how to do something), and work as book substitutes, to share with those who do not read much, or in case if there are not many books on a given topic (say, recent local legal practices). Unfortunately those are often hosted on YouTube, which, in addition to being blocked here (and in other places, see censorship of YouTube), tries to prevent downloads itself, but there is yt-dlp, which may work. I usually download videos for archival at 480p if the visual details matter (perhaps 2 to 5 MB per minute), or even 360p if it is mostly a speaker standing and talking for the whole long video (under 2 MB per minute), which is done with the -S "res:480" option. I have collected some video links, including interesting YouTube channels. One may consider relatively information-dense ones (lectures, online lessons) first, possibly followed by entertainment-education, pop-sci, and documentaries.

Other

Other large and legal archives to consider for backing up: Wikimedia Downloads, Complete OSM Data, arXiv and other Open Access sources. If one gets into tape storage, Common Crawl can be considered. For select website downloads, I use wget --mirror --page-requisites --convert-links --no-parent --continue --adjust-extension https://example.com/~foo/, occasionally adding something like --exclude-directories=photos,pictures or just listing URLs manually (since it can be hard to separate heavy bits of little interest from the others otherwise), and sometimes having to add --compression=gzip if wget gets confused otherwise, or --max-redirect=0 if there are redirects to semi-blocked websites with freezing connections (and while trying to download those directly, given that wget does not support SOCKS proxies). But some websites make archives available (as mine does, see ../files/archive.tgz), or they are hosted at GitHub/Codeberg/Tilde/etc "pages", making the archive available for download (also as mine does, see codeberg.org/defanor/pages). Some wiki-based websites also provide data dumps, static HTML or database ones.

Statistical ("ML", "AI") models for LLMs (llama.cpp) and speech recognition (whisper.cpp) may be useful to collect as well. LLMs in particular, while they do hallucinate, also contain plenty of information, and in a way that may make it easier to retrieve in some cases.