itch-dl
itch-dl copied to clipboard
Site Archive Inconsistencies
I'm backing up games on Itch and I've noticed multiple inconsistencies with the archived pages generated by itch-dl.
- Cyrillic (All unicode?) is garbled.
- 'Updated' and 'Published' dates are missing under 'More information'
- Some images like the cover and screenshots seem to be archived but not linked to the local mirror, and other images are missing like all store banners and user avatars.
- Developer log posts are not saved, and if I understand correctly the devlogs can also include past versions of software that itch-dl does not download. That would be super slick if itch-dl could back up all the past versions as well with an argument like
--devlog. - Useful information in community posts is also not archived.
I tried with and without using --mirror-web but there was not much of a difference. Screenshots are saved when specified but I did not note any additional benefit.
- Looks like sites primarily in non-Latin scripts got their encoding guessed incorrectly and the output ended up garbled - I've released 0.3.3 with a quick fix, check it it resolves your issues. I've tested it on Cyrillic and CJK pages which now look correct.
- Looks like published/updated/etc dates don't show up for some games (not all...?) if the webpage request is unauthenticated (requires proper session cookies, not just the API key). Need to investigate.
- Itch stores older versions separately on another API endpoint, they're not connected with devlogs directly. Either way, the downloader currently fetches just the latest version, but yup, it would be nice to be able to grab those as well. I've added #9 to track this in a separate issue.
But in general, yeah, the webpage mirroring feature is very barebones - the intended use case was to scrape just the front page and screenshots attached there, as that often includes instructions not included with games themselves. Getting image links correct, all the devlogs/comments, etc would require a lot more postprocessing.
I'll try to find some time to fix up dates and at least partial site parsing in the coming weeks, but I've got a lot on my plate right now until end of February, so can't say when :/
Thanks for your careful consideration.
Conceptually, I'm think I'm so keen on accurate mirroring because I can imagine a future where itch doesn't exist anymore, and this tool was used to back up a bunch of games and post them on archive.org. It would be a shame if anything was lost.