rethinkdb-python icon indicating copy to clipboard operation
rethinkdb-python copied to clipboard

Compressed output format (jsongz) for rethinkdb export/import

Open iantocristian opened this issue 3 years ago • 12 comments

Reason for the change https://github.com/rethinkdb/rethinkdb-python/issues/249

Description Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json. On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor. On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.

Checklist

References

Usage: rethinkdb-export -e test -d export --format jsongz rethinkdb-export -e test -d export --format jsongz --compression-level 5 rethinkdb-import -i test -d export

Tested with python 2.7.16 and python 3.8.5

iantocristian avatar Feb 19 '21 18:02 iantocristian

Hello @iantocristian 👋 First of all, thank you for your contribution here! 🎉

Could you please add some unit/integration tests for this functionality?

gabor-boros avatar Mar 05 '21 21:03 gabor-boros

👋 @gabor-boros

I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.

iantocristian avatar Mar 06 '21 11:03 iantocristian

@lsabi could you please double check this?

gabor-boros avatar Mar 23 '21 12:03 gabor-boros

... there are missing tests, which could become a problem. We could write them in a second moment.

I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.

iantocristian avatar Mar 24 '21 10:03 iantocristian

One other comment I got was related to the jsongz extension I used for the data files - why not json.gz? I used jsongz because splitext can't handle json.gz and it would have required more code changes elsewhere in the scripts.

Downside of using jsongz is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g. gzip -D command won't like the jsongz extension). Any thoughts about this?

iantocristian avatar Mar 24 '21 12:03 iantocristian

I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.

Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like

SUPPORTED_EXT = {
    "json": True,
    "gz": {"json": True}
} 

This way, files ending with json can be decoded immediately, while those ending with gz have to have json before the gz which is to be checked. Although I don't know how much work it could be performing the switch.

lsabi avatar Mar 24 '21 20:03 lsabi

To me it looks good.

Only the the new line feed which I don't know if it's worth removing it or not.

We can, in a second moment, write tests and check how much it influences the size of the generated file.

@gabor-boros what do you think? For me it can be passed

lsabi avatar Mar 26 '21 20:03 lsabi

Only the the new line feed which I don't know if it's worth removing it or not.

It's one new line per document right? + another one at the start and another one end.

For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain. For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain. Anything larger than 1kb gain is negligible.

Not worth it imo.

Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.

iantocristian avatar Mar 27 '21 08:03 iantocristian

Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.

Nevertheless, as I said, this can be done in a second moment.

What's the point about the jsongz? I don't understand it

lsabi avatar Mar 28 '21 20:03 lsabi

Nevertheless, as I said, this can be done in a second moment.

👍

What's the point about the jsongz? I don't understand it

That it wasn't my intention to change the content that's being dumped, just to compress it. Could have been an option for the json_writer but I thought it's less risk to have a separate writer.

iantocristian avatar Mar 30 '21 12:03 iantocristian

Don't worry, we can keep them separate and merge one day.

@gabor-boros do you have anything to add/to complain about this PR?

lsabi avatar Mar 30 '21 20:03 lsabi

@lsabi / @gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks

AlexC avatar Mar 22 '24 14:03 AlexC