itch-dl icon indicating copy to clipboard operation
itch-dl copied to clipboard

Site Archive Inconsistencies

Open JoshuaFern opened this issue 2 years ago • 3 comments

I'm backing up games on Itch and I've noticed multiple inconsistencies with the archived pages generated by itch-dl.

  • Cyrillic (All unicode?) is garbled.
  • 'Updated' and 'Published' dates are missing under 'More information'
  • Some images like the cover and screenshots seem to be archived but not linked to the local mirror, and other images are missing like all store banners and user avatars.
  • Developer log posts are not saved, and if I understand correctly the devlogs can also include past versions of software that itch-dl does not download. That would be super slick if itch-dl could back up all the past versions as well with an argument like --devlog.
  • Useful information in community posts is also not archived.

I tried with and without using --mirror-web but there was not much of a difference. Screenshots are saved when specified but I did not note any additional benefit.

JoshuaFern avatar Jan 29 '23 01:01 JoshuaFern

  • Looks like sites primarily in non-Latin scripts got their encoding guessed incorrectly and the output ended up garbled - I've released 0.3.3 with a quick fix, check it it resolves your issues. I've tested it on Cyrillic and CJK pages which now look correct.
  • Looks like published/updated/etc dates don't show up for some games (not all...?) if the webpage request is unauthenticated (requires proper session cookies, not just the API key). Need to investigate.
  • Itch stores older versions separately on another API endpoint, they're not connected with devlogs directly. Either way, the downloader currently fetches just the latest version, but yup, it would be nice to be able to grab those as well. I've added #9 to track this in a separate issue.

But in general, yeah, the webpage mirroring feature is very barebones - the intended use case was to scrape just the front page and screenshots attached there, as that often includes instructions not included with games themselves. Getting image links correct, all the devlogs/comments, etc would require a lot more postprocessing.

I'll try to find some time to fix up dates and at least partial site parsing in the coming weeks, but I've got a lot on my plate right now until end of February, so can't say when :/

DragoonAethis avatar Jan 29 '23 14:01 DragoonAethis

Thanks for your careful consideration.

Conceptually, I'm think I'm so keen on accurate mirroring because I can imagine a future where itch doesn't exist anymore, and this tool was used to back up a bunch of games and post them on archive.org. It would be a shame if anything was lost.

JoshuaFern avatar Jan 29 '23 19:01 JoshuaFern