modin icon indicating copy to clipboard operation
modin copied to clipboard

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

Open mvashishtha opened this issue 3 years ago • 5 comments

Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:

  • Even if we have already materialized the index objects in build_index, ray.get for the already computed size may be expensive (we should check this)
  • If we haven't materialized the index in build_index, the length() call may be unnecessarily blocking (maybe something else will block anyway, though?)

mvashishtha avatar Aug 01 '22 15:08 mvashishtha

I found another place that block async execution (dtypes computing if there isn't cache for that): https://github.com/modin-project/modin/blob/adb16a17f721048005520388080627975c6852d8/modin/core/io/file_dispatcher.py#L167

anmyachev avatar Aug 02 '22 08:08 anmyachev

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

jbrockmendel avatar Aug 08 '22 19:08 jbrockmendel

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

@jbrockmendel AFAIK parquet files have their own types that may not necessarily correspond to the types pandas assigns in read_parquet. @pyrito, you were thinking about this too, did you ever find a case where two datasets with the same parquet types get different pandas types after read_parquet?

mvashishtha avatar Aug 09 '22 19:08 mvashishtha

@mvashishtha @jbrockmendel I was thinking about this. I haven't confirmed this but I was wondering if datetime objects would give us some trouble here, esp when we get into timezones and stuff. These have previously been a pain when I've worked with dask.

pyrito avatar Aug 09 '22 19:08 pyrito

can you give an example? im guessing you're referring to pandas.DatetimeTZDtype?

jbrockmendel avatar Aug 09 '22 22:08 jbrockmendel

@jbrockmendel I don't have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I've had some problems before, but maybe the type mappings are better now between Arrow and pandas.

pyrito avatar Aug 11 '22 14:08 pyrito