pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

[ENH] Much faster compressed uploads using new REST API features

Open lmeyerov opened this issue 4 years ago • 1 comments

Especially in distributed settings, a bit of compression can go a long way for faster uploads:

Easy wins

The current REST API supports compression at several layers:

  • [ ] Maybe: Instead of Arrow, send single Parquet with Snappy compression
  • [ ] Generic: Send as gz/gzip
  • [x] Cache: Use File IDs for nodes/edges via a global weakmap of df -> FileID (https://github.com/graphistry/pygraphistry/pull/195)

Trickier wins

  • [ ] Do a quick per-col Categoricals check to see if we can dictionary-encode any cols
  • [ ] Multi-part uploads (multiple parquet, ..)
  • [x] Hash-checked files (https://github.com/graphistry/pygraphistry/pull/195)

Interface

Unclear what the defaults + user overrides should be --

Default:

  • No compression when nginx / localhost / 127.0.0.1
  • No compression when table is < X KB
  • Otherwise, compress?

Override:

  • In register < settings < plot() cascade, be able to decide what happens
  • When providing arrow/parquet, that may be meaningful too

Ex:

graphistry.register(server='nginx')
g.plot() # no compression
g.edges(small_df).plot() # no compression
g.edges(big_arr).plot() # auto-compress
graphistry.register(transfer_encoding='gzip', gzip_opts={...})
g = g.settings(transfer_type='parquet')
g.edges(small_arr).plot(parquet_opts={...})

Another thought is:

g.plot(compression='auto' | True | False | None)

  • When given pandas/cudf/arrow/etc., we do auto policies
  • When given parquet:
    • by default, we do nothing: the user can control many optimizations at that level and we just pass along
    • compression=True will let us start doing things again

Or somewhere inbetween..

Prioritization

  • The new File API and point-and-click features encourage more & bigger uploads
  • User reports of upload issues when on slow networks
  • Usage will ensure steady early exercise of the new APIs

References

  • Multiple potential encodings - gzip, brotli, ... - and not hard to add server support if any preferred
  • REST API: https://hub.graphistry.com/docs/api/2/rest/upload/data/#uploaddata2
  • PyArrow
    • Dictionary encoding for categoricals: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
    • new gzip-level support, but unclear if useful at that level: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
  • Parquet:
    • cudf defaults to snappy, I think: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet
    • pyarrow parquet writer has fancier per-col modes: https://docs.rapids.ai/api/cudf/nightly/api.html?highlight=parquet#cudf.io.parquet.to_parquet

lmeyerov avatar Nov 05 '20 19:11 lmeyerov

Partially addressed via https://github.com/graphistry/pygraphistry/pull/195 : Avoid reuploads with api=3 + .plot(as_files=True)

lmeyerov avatar Jan 12 '21 01:01 lmeyerov