mastodon-archive
mastodon-archive copied to clipboard
Compress JSON
I have just tested this script for the first time and the JSON file with the statuses is astonishingly huge, given that I only have a hand full of toots. There is a lot of redundancy in there, even simple ZIP compression reduces the size by about 90%.
There could be an option --compress that transparently adds compression when loading and saving these files using one of python's buildin compression modules.
I agree.
It doesn't matter much given the small size of the resulting text file, but may I advocate for a better compression algorithm than a ZIP ? I'm thinking about Zstandard, it's widely supported now (but not as much as others if not on linux ?), very fast compression/decompression for a very good compression ratio, but anything else is fine :)
Who ever implements it, gets to decide. 😄
Oh ok, for some reason I thought you were going to do it :sweat_smile:
I actually don't care much, I store it compressed (filesystem compression using btrfs) anyway.
For the record, a ~1GB archive is compressed to around 100MB (using zstd), which makes quite a big difference :slightly_smiling_face:
It sure does! I’m basically just storing the results of the Mastodon client calls so every respond contains all the account infos of the author, if I remember correctly. And it’s all pretty printed. So compression definitely helps!
As for myself, I’m just not courageous enough to run a non-standard file system. Ext4 forever, I guess. 😂
It shouldn't be the point here anyway, it would be great if the json was stored compressed anyway. I will see if I have time to implement this… Don't be too hopeful :sweat_smile:
I wonder whether this should be optional (or automatic: detect if a .gz variant already exists, and if it does, use that). I don't have a compressed filesystem, but if I had, I'm assuming I wouldn't want to have the data recompressed?
I'm not sure it's a big deal. And most of the time the filesystem detects it's already compressed (in fact : not possible to compress) and skip it.
Also, it's a quite rare use case.