Making an offline copy of Project Gutenberg

Almost a year ago, I got a bee in my bonnet to download all the English-language, text-file books in Project Gutenberg. I came up with a whole pipeline using wget to download a subset of their catalog and then ed scripts to shave everything down to the proper URLs, which were dumped into a file and then wget-ted in a second round to get the documents themselves. I had vague ideas about somehow diff-ing the current catalog and the one I first downloaded for periodic updates, but never got around to actually implementing it.

Cut to three days ago, when I went back to the Project Gutenberg website to re-read about ways to get robot access to the site and discovered on the offline catalogs page that they now put out a zipped archive of ALL Project Gutenberg text-file books, updated weekly. This reduced my workload to:

wget -c "https://gutenberg.org/cache/epub/feeds/txt-files.tar.zip" wget -c "https://gutenberg.org/dirs/GUTINDEX.zip" The text-files downloaded in about a half hour, the index in seconds. It took far longer to unzip and untar the text-files than it did to download them. But when I was done, I had my own offline copy of Project Gutenberg. And whenever I feel I need an update, I'll just download it all again.

~~~~~~~~~~~~~~~~~~~~~~~~~

~Thumos thumos@tilde.club February 2026