modin PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

Open mvashishtha opened this issue 3 years ago • 5 comments

Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:

Even if we have already materialized the index objects in build_index, ray.get for the already computed size may be expensive (we should check this)
If we haven't materialized the index in build_index, the length() call may be unnecessarily blocking (maybe something else will block anyway, though?)

Aug 01 '22 15:08 mvashishtha

I found another place that block async execution (dtypes computing if there isn't cache for that): https://github.com/modin-project/modin/blob/adb16a17f721048005520388080627975c6852d8/modin/core/io/file_dispatcher.py#L167

Aug 02 '22 08:08 anmyachev

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

Aug 08 '22 19:08 jbrockmendel

could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?

@jbrockmendel AFAIK parquet files have their own types that may not necessarily correspond to the types pandas assigns in read_parquet. @pyrito, you were thinking about this too, did you ever find a case where two datasets with the same parquet types get different pandas types after read_parquet?

Aug 09 '22 19:08 mvashishtha

@mvashishtha @jbrockmendel I was thinking about this. I haven't confirmed this but I was wondering if datetime objects would give us some trouble here, esp when we get into timezones and stuff. These have previously been a pain when I've worked with dask.

Aug 09 '22 19:08 pyrito

can you give an example? im guessing you're referring to pandas.DatetimeTZDtype?

Aug 09 '22 22:08 jbrockmendel

@jbrockmendel I don't have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I've had some problems before, but maybe the type mappings are better now between Arrow and pandas.

Aug 11 '22 14:08 pyrito

modin modin copied to clipboard

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.

modin
modin copied to clipboard