modin
modin copied to clipboard
PERF: infer partition dimensions in _compute_dtypes
Context: calls to _compute_dtypes are taking 31.2s apiece in the script I'm profiling, which makes zero sense.
Current code
dtypes = (
self.tree_reduce(0, lambda df: df.dtypes, dtype_builder)
.to_pandas()
.iloc[0]
)
The tree_reduce call creates a DataFrame whose partitions have ObjectRef objects for the _length_cache and _width_cache, but IIUC it is pretty easy to infer their dimensions. Something like
pre_dtypes = self.tree_reduce(0, lambda df: df.dtypes, dtype_builder)
parts = pre_dtypes._partitions
for i in range(parts.shape[0]):
for j in range(parts.shape[1]):
old_part = self._partitions[i, j]
new_part = parts[i, j]
new_part._length_cache = 1
new_part._width_cache = old_part._width_cache
dtypes = pre_dtypes.to_pandas().iloc[0]
This edit doesn't fix the slow _compute_dtypes calls. Is there a more idiomatic way of doing this setting?
This edit doesn't fix the slow _compute_dtypes calls. Is there a more idiomatic way of doing this setting?
This suggests to me that calculating widths is not itself the problem. Anyway, the to_pandas() will require every partition in pre_dtypes to finish its computation.
I recently was debugging a case of extremely slow performance in _compute_dtypes for a frame with shape 1 by 4 million. It turned out that computing dtypes 4 million times in concatenate was the bottleneck. Would that fix by any chance work for you:
https://github.com/mvashishtha/modin/commit/46cce7c021eea2e4abd20e2ca4dcd18cecc0fcb3
This suggests to me that calculating widths is not itself the problem.
Agreed. However I'm finding that 7-8% of my runtime is in _row_lengths (tentatively appears to be via _filter_empties) and I'm trying to narrow that down wherever possible. (xref #4785, #4738, #4754)
Would that fix by any chance work for you:
No difference.
However I'm finding that 7-8% of my runtime is in _row_lengths (tentatively appears to be via _filter_empties)
In my experience _filter_empties tends to block non-shape-preserving operations that are going to be expensive anyway. You might try syncing past adb16a17f721048005520388080627975c6852d8, which stops _filter_empties from blocking. But again, to_pandas() here will block on all partitions.
Looks like the relevant cases in compute_dtypes are where self._partitions.shape[1] > 1. More specifically, I'm seeing cases with partition shapes (1, 13) and (1, 4) compute in a few hundredths of a second, while a case with partition shape (16, 1) takes 4 seconds.
Moreover, the extra time appears to be in the pre-call portion of run_f_on_minimally_updated_metadata. maybe we can avoid some of that work in this case?
If I disable the run_f_on_minimally_updated_metadata portion of _compute_dtypes (specifically, by not going through tree_reduce), the hot spot moves to Partition.to_numpy.
If I use BenchmarkMode.put(True), the time taken by compute_dtypes is cut by 2-3x. At best it still constitutes a way-too-big 6% of runtime.
Moreover, the extra time appears to be in the pre-call portion of run_f_on_minimally_updated_metadata.
This is very surprising. The expensive parts of that function call should be _propagate_index_objs, which only does add_to_apply_calls, which should be cheap because it adds to partition call queues without even triggering computation: https://github.com/modin-project/modin/blob/aeff7ac70162b53f8c6178f048498c2ee9bd631b/modin/core/dataframe/pandas/dataframe/dataframe.py#L567
_propagate_index_objs calls _filter_empties internally and if the dataframe doesn't have row_lengths and column_widths cashes, computations on partitions are triggered. That might be the case @jbrockmendel is experiencing.
We do preserve lengths and widths in _compute_dtypes since tree_reduce pattern is used there so closing this issue. Follow https://github.com/modin-project/modin/issues/2751 for potential optimizations in _compute_dtypes.