#2572 Make ugoira a bit less fucking retarded file format
This merge request if fueled by INDESCRIBABLE RAGE towards whoever the fuck at pixiv decided that it would be an awesome idea to take a perfectly fine pixel animation in lossless gif format, split it into frames, reencode each frame as q=90 jpeg, and then put into a zip archive many times larger than the original file, as well as towards "people" at danbooru dot donmai dot us (whether they are builders, contributors, approvers, admins, or anyone of even higher role who chooses to post post #3638951 over asset #22382079) that are either fucking dumb to know the difference or just lazy to go beyond pixiv/twitter and do three clicks at TPT to grab the original file from fantia/nijie/fanbox/misskey/whatever.
Anyways, now that I vented:
The idea here is very simple - just grab the original-ish ugoiraN.ext frames by ourselves and pack them into archive that will be later fed to ZipImagePlayer also by ourselves.
To pack the files the pretty low-level-ish libarchive interface is used, so no new dependencies.
The resulting archive is very minimal - only required file headers are written and no file attributes are copied - but it preserves the files within (down to md5 match), produces the same archive each time (also down to md5 match), and works with mentioned above ZipImagePlayer.
The caveat here is that while the last part of https://github.com/danbooru/danbooru/issues/2572#issuecomment-175119259 is irrelevant within the one implementation, the changes to archiving implementation still may/will break md5 checks.
That being said, for cases when ugoira is used the way it was meant to be used (almost never), it actually provides much higher quality that could be achieved with gifs or videos.
An extra note (1):
I think, it should also be possible to save each frame as a separate media asset and use the zip player in an "unpacked mode" - that will potentially help avoid duplicates since there is no custom file format involved, and save space in case of repeated frames or revisions with only some frames or only frame delays changed, but I'm to ~~incompetent~~ dumb to consider how to save the frame data, what should "download" button do in that case, and how to preserve the backwards compatibility with existing .zip assets.
Alternatively, going by the "create own archive" way, I personally would want to go even further - by doing what Nandaka/PixivUtil2 does - use the custom .ugoira file extension for the resulting zip archive and save the frame delays as animation.json within it. (see https://github.com/Nandaka/PixivUtil2/issues/69 for more details) This potentially would also allow uploading ugoiras from disk (~~I think there was an issue for that, but I couldn't find it~~ #5247).
An extra note (2): Ugoiras with png frames actually can preserve transparency. Even more so, VP8 and VP9 codecs, used for webm samples, include alpha channel support and also preserve transparency by default. I can't even properly express how FUCKING HAPPY that does make me.
A few SFW transparent ugoira examples: pixiv #112298777 .zip .webm pixiv #101003492 .zip .webm pixiv #118216840 .zip .webm
Doing further experiments and investigations on ugoira, I realized that it is possible to break the md5 checks in the opposite direction - an artist can upload a new revision with the exact same frames, but change the frame delays, and danbooru will incorrectly assume that those revisions are identical. This is not an issue with .zip archives generated by pixiv, since it always regenerates them on revision in any case, but deterministic .zip files produced in this PR will have the exact same md5, since the metadata is stored outside the archive. To fix that, I decided to do what I said https://github.com/danbooru/danbooru/pull/5793#issuecomment-2249836810 and just include the metadata inside the archive.
Another thing to be aware of here is that to my understanding ZipImagePlayer reads the first frames.length files from the archive in the order they are written in it. This isn't a problem for .zip-files produced by pixiv or danbooru, but may break some things if danbooru was to allow unprocessed user-uploaded ugoiras.
Went ahead just in case and backed up under 51635a6 implementation everything I could find posted across several public danbooru sites + personal archive.
So far, it is 14792 posts worth 190 GB of disk space, with 1870 unsuccessful (i.e. bad_id) attempts.
Even if the final implementation gonna be different, it would be possible to readjust for it, as all frames, delays, and source urls are preserved.
-
e71ad8d: Undid my attempts at speeding up frame download process as it, despite actually working, kinda sucked anyway. Just keep in mind that it might need to be fixed later.
-
gallery-dlalso did archive re-zipping, but a bit differently (https://github.com/mikf/gallery-dl/discussions/6147, https://github.com/mikf/gallery-dl/commit/ff07aef7768a2b25c6e24af89231c78e74785cdb, https://github.com/mikf/gallery-dl/commit/319116c92306e5cc684d9d72776b0dd95644b324). First, filesmtimeis set at timestamp in illustration url instead of just zeroing it. Not sure if this really needed as it may break duplicate checks. Second, the metadata is also stored inside archive inanimation.jsonfile, but in different format - the same as in my/pixiv'sframes, but at the top level instead of nested in an object. For comparison, PixivUtil2 saves it the same way as I do here.
Regarding the animation.json file:
- Our new ugoira player will need some slight changes to handle this, because it assumes that the zip file doesn't contain any extra files and that filenames are exactly 10 bytes long (it does this to make parsing the zip file take one less HTTP request).
- It would be nice to include some extra metadata in the
animation.jsonfile, such as which Pixiv post the ugoira came from. I don't know if this would be compatible with other ugoira players though (what others exist?). - It would be better to make
animation.jsonthe first file in the zip file instead of the last, and for it to include the offsets of the other files in the zip file. That way we could read the zip file from the start in one pass and find out where all the files are, instead of having to read the central directory from the end of the file to find out where the files are, then go back and read them from the start. I think this would break Pixiv's ugoira player, but I don't know about other players.
I think, it should also be possible to save each frame as a separate media asset and use the zip player in an "unpacked mode"
This would be possible, but we have 2.7 million ugoira frames currently, so it would add 2.7 million media assets (plus 135 million AI tags at 50 AI tags per asset). The database bloat would outweigh the savings from deduping frames.
I do think it wouldn't be a bad idea to store keyframes (say 1 frame per second, or a frame after every significant scene change). This could allow for better reverse image searches.
Ugoiras with png frames actually can preserve transparency.
This is annoying for us because these are probably intended to be played on a white background like on Pixiv, so we have to make our ugoira player background white, even though a black background looks better while the video is loading.
- It would be nice to include some extra metadata in the
animation.jsonfile
The only needed field for playback is frames array. Otherwise, it can contain any other data.
- I think this would break Pixiv's ugoira player
Are we really concerned about it? Pixiv ZIP player most certainly will, as it only expects the archive to contain only frames in correct order. IIRC, filenames have not importance for it at all. The only other one I am aware of is already mentioned above HoneyView.
these are probably intended to be played on a white background like on Pixiv
Not right. On pixiv these are on white background due to conversion, not by artist intent. Otherwise the same logic would apply to any other site, especially for [[thumbnail_surprise]], dark/light theme changing images, and such. Transparent background should be fine.
The only needed field for playback is
framesarray. Otherwise, it can contain any other data.
So far I'm aware of three different formats for animate.json:
- gallery-dl's, which is like
[{ "file": "000001.jpg", delay: 125 }]. - PixivUtil2's, which is like
{ "frames": { "file": "000001.jpg", delay: 125 }, "zipSize": 123456 }(I think, I couldn't get PixivUtil2 to work). - PixivTookit's, which is like:
{
"ugokuIllustData": {
"src": "https://i.pximg.net/img-zip-ugoira/img/2021/09/13/08/02/41/92714531_ugoira600x600.zip",
"originalSrc": "https://i.pximg.net/img-zip-ugoira/img/2021/09/13/08/02/41/92714531_ugoira1920x1080.zip",
"mime_type": "image/jpeg",
"frames": [
{
"file": "000000.jpg",
"delay": 100
}
]
}
}
For players, I know that BandiView (formerly HoneyView) and Hydrus support .ugoira files. Hydrus supports the gallery-dl and PixivUtil2 formats and BandiView supports the gallery-dl and PixivToolkit formats (I guess it probably supports PixivUtil2 too, but I couldn't test it).
For us, I think we can use the PixivUtil2 format and add some extra fields. Hydrus looks like it should ignore any extra fields, and BandiView hopefully should too, but I haven't tested it. That's the question here, which formats exist, which one should we use, and can other tools tolerate it if we extend the format.
Are we really concerned about it?
I don't know, for archival purposes someone might want to be able to use the original player.
Not right. On pixiv these are on white background due to conversion, not by artist intent.
Artists don't always make the background transparent on purpose, often they work on a white background and don't think about it being viewed otherwise. Sometimes this leads to mistakes where they forget to hide layers or leave holes in the background.
Those three formats mentioned all kinda correspond to pixiv's /ugoira_meta api response (example).
-
gallery-dlpicks onlyframesarray and puts in on top level -
PixivUtil2grabs the response as is (so it also includessrc,originalSrc,mime_typefields); and also addszipSize(size of archive excludinganimation.json), which is a non-standard custom field that is used for PixivUtil2's own duplicate checks i believe (?) -
PixivToolkitgrabs the response as is and wraps it inugokuIllustData - this PR picks
framesand writesmime_typethat corresponds to frame filename extension (pixiv does not allow uploading frames of different formats)
Given that neither BandiView nor Hydrus choke on PixivUtil2's zipSize field, I assume it is safe to add extra info in json.