pygraphistry
pygraphistry copied to clipboard
[ENH] Much faster compressed uploads using new REST API features
Especially in distributed settings, a bit of compression can go a long way for faster uploads:
Easy wins
The current REST API supports compression at several layers:
- [ ] Maybe: Instead of Arrow, send single Parquet with Snappy compression
- [ ] Generic: Send as gz/gzip
- [x] Cache: Use File IDs for nodes/edges via a global weakmap of df -> FileID (https://github.com/graphistry/pygraphistry/pull/195)
Trickier wins
- [ ] Do a quick per-col Categoricals check to see if we can dictionary-encode any cols
- [ ] Multi-part uploads (multiple parquet, ..)
- [x] Hash-checked files (https://github.com/graphistry/pygraphistry/pull/195)
Interface
Unclear what the defaults + user overrides should be --
Default:
- No compression when
nginx/localhost/127.0.0.1 - No compression when table is < X KB
- Otherwise, compress?
Override:
- In
register<settings<plot()cascade, be able to decide what happens - When providing arrow/parquet, that may be meaningful too
Ex:
graphistry.register(server='nginx')
g.plot() # no compression
g.edges(small_df).plot() # no compression
g.edges(big_arr).plot() # auto-compress
graphistry.register(transfer_encoding='gzip', gzip_opts={...})
g = g.settings(transfer_type='parquet')
g.edges(small_arr).plot(parquet_opts={...})
Another thought is:
g.plot(compression='auto' | True | False | None)
- When given pandas/cudf/arrow/etc., we do auto policies
- When given parquet:
- by default, we do nothing: the user can control many optimizations at that level and we just pass along
compression=Truewill let us start doing things again
Or somewhere inbetween..
Prioritization
- The new File API and point-and-click features encourage more & bigger uploads
- User reports of upload issues when on slow networks
- Usage will ensure steady early exercise of the new APIs
References
- Multiple potential encodings - gzip, brotli, ... - and not hard to add server support if any preferred
- REST API: https://hub.graphistry.com/docs/api/2/rest/upload/data/#uploaddata2
- PyArrow
- Dictionary encoding for categoricals: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
- new gzip-level support, but unclear if useful at that level: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
- Parquet:
- cudf defaults to snappy, I think: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet
- pyarrow parquet writer has fancier per-col modes: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet
Partially addressed via https://github.com/graphistry/pygraphistry/pull/195 : Avoid reuploads with api=3 + .plot(as_files=True)