-
Anna's Archive scraped Worldcat
https://annas-blog.org/worldcat-scrape.html
"220GB compressed, 2.2TB uncompressed. 1.3 billion unique IDs (1,348,336,870), covered by 1.8 billion records (1,888,381,236), so 540 million duplicates (29%). 600 million are redirects or 404s, so 700 million unique actual records."torrent here:
https://annas-archive.org/torrents#worldcat -
@jorol yeah, very likely. This analysis is interesting, I'm afraid I want to know the duplicate numbers in others catalog (like the Italian one).
If I find space on disk I'll download their scrape, I wish they had used parquet instead of a single zstd zipped jsonl -
read, read, read, read, read, read.