pygraphistry [ENH] Much faster compressed uploads using new REST API features

[ENH] Much faster compressed uploads using new REST API features

Open lmeyerov opened this issue 4 years ago • 1 comments

Especially in distributed settings, a bit of compression can go a long way for faster uploads:

The current REST API supports compression at several layers:

[ ] Maybe: Instead of Arrow, send single Parquet with Snappy compression
[ ] Generic: Send as gz/gzip
[x] Cache: Use File IDs for nodes/edges via a global weakmap of df -> FileID (https://github.com/graphistry/pygraphistry/pull/195)

[ ] Do a quick per-col Categoricals check to see if we can dictionary-encode any cols
[ ] Multi-part uploads (multiple parquet, ..)
[x] Hash-checked files (https://github.com/graphistry/pygraphistry/pull/195)

Unclear what the defaults + user overrides should be --

Default:

Override:

Ex:

graphistry.register(server='nginx')
g.plot() # no compression

g.edges(small_df).plot() # no compression

g.edges(big_arr).plot() # auto-compress

graphistry.register(transfer_encoding='gzip', gzip_opts={...})
g = g.settings(transfer_type='parquet')
g.edges(small_arr).plot(parquet_opts={...})

Another thought is:

g.plot(compression='auto' | True | False | None)

When given pandas/cudf/arrow/etc., we do auto policies
When given parquet:
- by default, we do nothing: the user can control many optimizations at that level and we just pass along
- compression=True will let us start doing things again

Or somewhere inbetween..

Multiple potential encodings - gzip, brotli, ... - and not hard to add server support if any preferred
REST API: https://hub.graphistry.com/docs/api/2/rest/upload/data/#uploaddata2
PyArrow
- Dictionary encoding for categoricals: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
- new gzip-level support, but unclear if useful at that level: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
Parquet:
- cudf defaults to snappy, I think: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet
- pyarrow parquet writer has fancier per-col modes: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet

Nov 05 '20 19:11 lmeyerov

Partially addressed via https://github.com/graphistry/pygraphistry/pull/195 : Avoid reuploads with api=3 + .plot(as_files=True)

Jan 12 '21 01:01 lmeyerov