These notes grew out of those on personal data storage, which cover the technical means. I used to keep a local music collection since the times before broadband and unmetered connectivity around here, and generally preferred to avoid reliance on online services, particularly commercial ones, since those tend to let users down. As the local censorship advanced, complete with a partial Internet blackout, and threatening to impose a complete blackout, while inexpensive storage device capacities increased, I started storing more of public data, in addition to my private data backups.
Apart from Internet blackouts or individual resource blocking by a government, usual data sources may become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user), or due to the publisher changing their policies. These notes include suggestions on the kinds of public data to backup, along with links to some of them, their size estimates.
Written works tend to be the most information-dense, making it easy to collect and store much more of those than one could hope to read in a lifetime.
Kiwix (with its OpenZIM archives) is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 75,000 public domain books by 2026), Wikipedia, Wikisource, Wikibooks, Wikiversity, Wiktionary, ready.gov, WikiHow, various StackExchange projects, Khan Academy, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics).
textfiles.com provides archives of files grouped by category, which are well-compressed, curious, and entertaining. RFC Editor bulk retrieval ceased to serve readily available archives by 2026, but one can rsync it, optionally archiving and compressing afterwards, e.g.:
rsync -avz --delete rsync.rfc-editor.org::rfcs-text-only/ rfcs-text-only/ tar --group=nogroup --owner=nobody -czf rfcs-text-only.tgz rfcs-text-only/
The POSIX (SUS) specification is useful to have at hand: POSIX.1-2024 is available as an archive (see "Downloads"). Along those lines, there are programming language specifications (reports), and other relevant specifications and references: ISO C, Haskell Language Report, Scheme Reports, Python documentation downloads, RISC-V specification, Intel 64 and IA-32 Architectures Software Developer's Manual, AMD64 Architecture Programmer's Manual, Linux Foundation Referenced Specifications, USB specifications, Bluetooth specifications, ACPI and UEFI specifications, PostgreSQL manual, XMPP Extension Protocols, etc.
Then there are copyright-infringing but much larger libraries like Library Genesis (a trimmed down, txt-only version used to be available at offlineos.com, but apparently not anymore), the-eye.eu books, Anna's Archive, Z-library. The Pirate Bay or similar torrent trackers may help to find book collections, including MIT mathematics and physics books, Cambridge Histories and philosophy companion books, Oxford "Very Short Introductions", Routledge books. As well as works grouped by an author (e.g., Gardner, Feynman). Other topics to consider acquisition of modern (text)books on: major philosophy works, electronics and radio, engineering, sociology, economics, computing, cooking, physical exercises, survival, fiction, medicine (e.g., the Merck manual), any topics of interest and other sciences. Other individual books on physics and mathematics, history. Consider the list of books complementary to Wikisource and PG (about 30 GB). Literary awards and charts can be handy for finding books: Pulitzer, Nebula, Locus, Bentley, Booker, Nature's analysis of the 100 most cited papers, The Guardian's top 100 books of all time, The Guardian's 100 best novels written in English, The NYT's 100 Best Books of the 21st Century, and similar lists. UN and other organizations' reports may also be of interest.
One can reduce PDF size (compress the images) with GhostScript
or ImageMagick, among others, sometimes reducing the size by an
order of magnitude: see "Efficient PDF optimization with
Ghostscript CLI". For instance: gs -q -sDEVICE=pdfwrite
-dPDFSETTINGS=/screen -dCompatibilityLevel=1.4 -o out.pdf
in.pdf (possibly with -dCompressFonts=true
and other options). Its -dFirstPage=$START
-dLastPage=$END options are also handy sometimes, to
extract pages of interest (including cases when some
crackpottery is attached to books: that is one of the ways in
which the crackpots try to promote it). While EPUBs (basically
ZIP archives with HTML and images) can be compressed by
compressing individual images within those. Sometimes files can
be removed from an EPUB archive, and it can be trimmed down by
passing through pandoc (which would remove included fonts, for
instance).
OpenStax provides good and freely available textbooks under the CC BY license, available for download in PDF. See OpenStax GitHub repositories for their CNXML sources and related tools, though in 2024 I found it tricky to build HTML out of those, and then it still was not good enough for printing. LibreTexts is supposed to be similar, though the licensing information is unclear in some cases, some links lead to HTTP 404 errors, and some of the books are quite messy (attempting to embed YouTube videos into PDFs, having every other page filled with listings of undeclared licenses, or with "welcome" messages). While its subdomains (math, phys, etc) geo-block direct requests from Russia, the books are available without proxying via commons.libretexts.org. One can also search for libre book sources on platforms like GitHub, possibly querying for TeX sources: there are occasional seemingly decent and not well-known textbooks, like Introductory Physics: Building Models to Describe Our World, An Infinitely Large Napkin.
As of 2026, all those (Wikipedia, Wikisource, Wiktionary, Project Gutenberg, OpenStax and other complementary books) would take just 400 to 500 GB, even with images and some non-English versions added. While much of programming documentation, particularly manuals, library references, and sources, is available from system repositories.
Apart from censoring books and the Internet, dictatorships like to issue "national operating systems" and mandate their spyware, or simply disrupting connections to system repositories as collateral damage, so backing up software can also be useful.
Software sources are particularly useful to backup for potential isolated usage, ensuring the ability to study and customize those, but one needs some binaries to bootstrap a system. Some of the options to consider are (with size estimates from January of 2026):
debmirror, for amd64 trixie (13.3) with
sources. While the Mirror Size page lists numbers for
mirroring all suites. One may also consider usage of a caching
proxy server, apt-cacher-ng, and Modifying Debian
CD. Unlike most others, Debian repositories contain all the
source packages, which include upstream sources.Debian, in addition to being an all-around good system, seems to be a good option for such mirroring as well. The mirroring itself is done rather easily:
sudo apt install debmirror debian-keyring gpg --no-default-keyring --keyring trustedkeys.gpg --import /usr/share/keyrings/debian-archive-keyring.gpg gpg --list-keys --keyring trustedkeys.gpg debmirror -v -d trixie -a amd64 --source -h mirror.mephi.ru --method=rsync /mnt/backup/debian/mirror/
An up-to-date live Debian CD/USB image is useful to store along with it, and perhaps a Debian wiki dump. As well as necessary additional firmware for one's hardware, and possibly firmware for devices other than regular computers, such as OpenWRT images for routers, GrapheneOS or LineageOS images for phones and tablets (along with individual program distributions, APKs; some software I use is listed in the note on mobile computing), KOReader for e-readers. Consider F-Droid mirroring and OpenWRT source code saving, or backups of individual packages.
As mentioned in the introduction, I always kept a music collection, and probably this is quite common. While musical records may seem less important than books and other written works, they still have a cultural value, provide entertainment. My music discovery note seems relevant here.
Audibooks (including BBC radio collections) may also be useful to collect, even if one does not listen to those normally.
Also for cultural and entertainment purposes, there are movies, and particularly long TV series may be suitable for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia; for humorous ones, see Black Books, The IT Crowd, Taskmaster, plenty of sitcoms.
Music videos are nice to have around, for the same reasons.
Lectures and educational videos on varied subjects can be both
useful in addition to books (as 3blue1brown, providing useful
illustrations and intuitive explanations, or various arts and
crafts, or exercises, demonstrating how to do something), and
work as book substitutes, to share with those who do not read
much, or in case if there are not many books on a given topic
(say, recent local legal practices). Unfortunately those are
often hosted on YouTube, which, in addition to being blocked
here (and in other places, see censorship of YouTube), tries to
prevent downloads itself, but there is yt-dlp,
which may work. I usually download videos for archival at 480p
if the visual details matter (perhaps 2 to 5 MB per minute), or
even 360p if it is mostly a speaker standing and talking for the
whole long video (under 2 MB per minute), which is done with
the -S "res:480" option. I have collected
some video links, including interesting YouTube channels. One
may consider relatively information-dense ones (lectures, online
lessons) first, possibly followed by entertainment-education,
pop-sci, and documentaries.
Other large and legal archives to consider for backing
up: Wikimedia Downloads, Complete OSM Data, arXiv and other Open
Access sources. If one gets into tape storage, Common Crawl can
be considered. For select website downloads, I use wget
--mirror --page-requisites --convert-links --no-parent
--continue --adjust-extension https://example.com/~foo/,
occasionally adding something
like --exclude-directories=photos,pictures or just
listing URLs manually (since it can be hard to separate heavy
bits of little interest from the others otherwise), and
sometimes having to add --compression=gzip if wget
gets confused otherwise, or --max-redirect=0 if
there are redirects to semi-blocked websites with freezing
connections (and while trying to download those directly, given
that wget does not support SOCKS proxies). But some websites
make archives available (as mine does,
see ../files/archive.tgz), or they are hosted at
GitHub/Codeberg/Tilde/etc "pages", making the archive available
for download (also as mine does,
see codeberg.org/defanor/pages). Some wiki-based websites also
provide data dumps, static HTML or database ones.
Statistical ("ML", "AI") models for LLMs (llama.cpp) and speech recognition (whisper.cpp) may be useful to collect as well. LLMs in particular, while they do hallucinate, also contain plenty of information, and in a way that may make it easier to retrieve in some cases.