rethinkdb-python
rethinkdb-python copied to clipboard
Compressed output format (jsongz) for rethinkdb export/import
Reason for the change https://github.com/rethinkdb/rethinkdb-python/issues/249
Description Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json. On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor. On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.
Checklist
- [x] I have read and agreed to the RethinkDB Contributor License Agreement
References
Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export
Tested with python 2.7.16 and python 3.8.5
Hello @iantocristian 👋 First of all, thank you for your contribution here! 🎉
Could you please add some unit/integration tests for this functionality?
👋 @gabor-boros
I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.
@lsabi could you please double check this?
... there are missing tests, which could become a problem. We could write them in a second moment.
I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.
One other comment I got was related to the jsongz
extension I used for the data files - why not json.gz
? I used jsongz
because splitext can't handle json.gz
and it would have required more code changes elsewhere in the scripts.
Downside of using jsongz
is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g. gzip -D
command won't like the jsongz
extension). Any thoughts about this?
I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.
Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like
SUPPORTED_EXT = {
"json": True,
"gz": {"json": True}
}
This way, files ending with json
can be decoded immediately, while those ending with gz
have to have json
before the gz
which is to be checked. Although I don't know how much work it could be performing the switch.
To me it looks good.
Only the the new line feed which I don't know if it's worth removing it or not.
We can, in a second moment, write tests and check how much it influences the size of the generated file.
@gabor-boros what do you think? For me it can be passed
Only the the new line feed which I don't know if it's worth removing it or not.
It's one new line per document right? + another one at the start and another one end.
For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain. For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain. Anything larger than 1kb gain is negligible.
Not worth it imo.
Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.
Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.
Nevertheless, as I said, this can be done in a second moment.
What's the point about the jsongz? I don't understand it
Nevertheless, as I said, this can be done in a second moment.
👍
What's the point about the jsongz? I don't understand it
That it wasn't my intention to change the content that's being dumped, just to compress it.
Could have been an option for the json_writer
but I thought it's less risk to have a separate writer.
Don't worry, we can keep them separate and merge one day.
@gabor-boros do you have anything to add/to complain about this PR?
@lsabi / @gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks