modin
modin copied to clipboard
PERF: Consider whether to avoid partition.length() in the parquet dispatcher.
Per @YarShev here, we should probably not call partition.length() to get partition sizes in the parquet dispatcher:
- Even if we have already materialized the index objects in
build_index,ray.getfor the already computed size may be expensive (we should check this) - If we haven't materialized the index in
build_index, thelength()call may be unnecessarily blocking (maybe something else will block anyway, though?)
I found another place that block async execution (dtypes computing if there isn't cache for that): https://github.com/modin-project/modin/blob/adb16a17f721048005520388080627975c6852d8/modin/core/io/file_dispatcher.py#L167
could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?
could we get the dtypes from the parquet file metadata and avoid the need to call compute_dtypes later?
@jbrockmendel AFAIK parquet files have their own types that may not necessarily correspond to the types pandas assigns in read_parquet. @pyrito, you were thinking about this too, did you ever find a case where two datasets with the same parquet types get different pandas types after read_parquet?
@mvashishtha @jbrockmendel I was thinking about this. I haven't confirmed this but I was wondering if datetime objects would give us some trouble here, esp when we get into timezones and stuff. These have previously been a pain when I've worked with dask.
can you give an example? im guessing you're referring to pandas.DatetimeTZDtype?
@jbrockmendel I don't have a minimum example I could show off the bat, but I was wondering if pandas.DatetimeTZDtype could cause some trouble here. I've had some problems before, but maybe the type mappings are better now between Arrow and pandas.