Update: Downloading all archive.org metadata

BermudaHighball@lemmy.dbzer0.com · 4 months ago

Update: Downloading all archive.org metadata

thingsiplay@beehaw.org · 4 months ago

Thanks for the update. In my recent research and backing up stuff, there was numerous content that are no longer available. But the entries are there, only the files are not. I think some files appeared back at later time. My assumption was most data is coming back slowly from a backup, if they have any. Torrent and metadata files are generated automatically. That means if they are deleted, then they would be rebuild after some time, I assume. There is so much data, I have no idea how long this would take…

I will keep looking at some files again and again, to see if they come back. Otherwise we lost a lot of data and history.

BermudaHighball@lemmy.dbzer0.com · edit-2 4 months ago

Yes, exactly why I wanted to start this project. It’s nice to have the Internet Archive but we cannot trust that content won’t be taken down eventually. Even just storage costs might become an issue in the future for data that gets maybe 30 total views over many years. But it is nice to hear some of the data you were looking at is coming back.

Long term, it would be nice for a community of users to create a decentralized index of Internet Archive metadata so it cannot get taken down and has the torrent files of the content so people can share it and participate in the seeding for the content they care about. The Internet Archive might cooperate to make it easier to do this, for example by using Bittorrent v2 which would help us detect file duplication and not have to use padding files since all files are aligned to pieces in v2.

Currently there is little incentive for people to seed the Internet Archive content but no doubt it will become more important to do that in the future.

kabi@lemm.ee · 4 months ago

If you store it in compressed chunks (or on a file system that supports compression, I guess), that should be a great deal smaller, being text only.

They supposedly never delete good stuff, just make it unavailable, as you said. Maybe we’ll get them in a hundred or so years!

ArchiveTeam’s uploads also don’t have torrent files (anymore). Their wiki says that they disabled it to lighten the load on IA’s servers, as creating a .torrent file for a 10+GB upload takes considerable time, especially if it has to be redone.