modin icon indicating copy to clipboard operation
modin copied to clipboard

PERF: infer partition dimensions in _compute_dtypes

Open jbrockmendel opened this issue 3 years ago • 7 comments

Context: calls to _compute_dtypes are taking 31.2s apiece in the script I'm profiling, which makes zero sense.

Current code

dtypes = (
    self.tree_reduce(0, lambda df: df.dtypes, dtype_builder)
    .to_pandas()
    .iloc[0]
)

The tree_reduce call creates a DataFrame whose partitions have ObjectRef objects for the _length_cache and _width_cache, but IIUC it is pretty easy to infer their dimensions. Something like

pre_dtypes = self.tree_reduce(0, lambda df: df.dtypes, dtype_builder)
parts = pre_dtypes._partitions

for i in range(parts.shape[0]):
    for j in range(parts.shape[1]):
        old_part = self._partitions[i, j]
        new_part = parts[i, j]
        new_part._length_cache = 1
        new_part._width_cache = old_part._width_cache

dtypes = pre_dtypes.to_pandas().iloc[0]

This edit doesn't fix the slow _compute_dtypes calls. Is there a more idiomatic way of doing this setting?

jbrockmendel avatar Aug 05 '22 17:08 jbrockmendel

This edit doesn't fix the slow _compute_dtypes calls. Is there a more idiomatic way of doing this setting?

This suggests to me that calculating widths is not itself the problem. Anyway, the to_pandas() will require every partition in pre_dtypes to finish its computation.

I recently was debugging a case of extremely slow performance in _compute_dtypes for a frame with shape 1 by 4 million. It turned out that computing dtypes 4 million times in concatenate was the bottleneck. Would that fix by any chance work for you:

https://github.com/mvashishtha/modin/commit/46cce7c021eea2e4abd20e2ca4dcd18cecc0fcb3

mvashishtha avatar Aug 05 '22 17:08 mvashishtha

This suggests to me that calculating widths is not itself the problem.

Agreed. However I'm finding that 7-8% of my runtime is in _row_lengths (tentatively appears to be via _filter_empties) and I'm trying to narrow that down wherever possible. (xref #4785, #4738, #4754)

Would that fix by any chance work for you:

No difference.

jbrockmendel avatar Aug 05 '22 17:08 jbrockmendel

However I'm finding that 7-8% of my runtime is in _row_lengths (tentatively appears to be via _filter_empties)

In my experience _filter_empties tends to block non-shape-preserving operations that are going to be expensive anyway. You might try syncing past adb16a17f721048005520388080627975c6852d8, which stops _filter_empties from blocking. But again, to_pandas() here will block on all partitions.

mvashishtha avatar Aug 05 '22 17:08 mvashishtha

Looks like the relevant cases in compute_dtypes are where self._partitions.shape[1] > 1. More specifically, I'm seeing cases with partition shapes (1, 13) and (1, 4) compute in a few hundredths of a second, while a case with partition shape (16, 1) takes 4 seconds.

Moreover, the extra time appears to be in the pre-call portion of run_f_on_minimally_updated_metadata. maybe we can avoid some of that work in this case?

jbrockmendel avatar Aug 05 '22 22:08 jbrockmendel

If I disable the run_f_on_minimally_updated_metadata portion of _compute_dtypes (specifically, by not going through tree_reduce), the hot spot moves to Partition.to_numpy.

If I use BenchmarkMode.put(True), the time taken by compute_dtypes is cut by 2-3x. At best it still constitutes a way-too-big 6% of runtime.

jbrockmendel avatar Aug 08 '22 17:08 jbrockmendel

Moreover, the extra time appears to be in the pre-call portion of run_f_on_minimally_updated_metadata.

This is very surprising. The expensive parts of that function call should be _propagate_index_objs, which only does add_to_apply_calls, which should be cheap because it adds to partition call queues without even triggering computation: https://github.com/modin-project/modin/blob/aeff7ac70162b53f8c6178f048498c2ee9bd631b/modin/core/dataframe/pandas/dataframe/dataframe.py#L567

mvashishtha avatar Aug 08 '22 17:08 mvashishtha

_propagate_index_objs calls _filter_empties internally and if the dataframe doesn't have row_lengths and column_widths cashes, computations on partitions are triggered. That might be the case @jbrockmendel is experiencing.

YarShev avatar Aug 08 '22 18:08 YarShev

We do preserve lengths and widths in _compute_dtypes since tree_reduce pattern is used there so closing this issue. Follow https://github.com/modin-project/modin/issues/2751 for potential optimizations in _compute_dtypes.

YarShev avatar Mar 20 '24 09:03 YarShev