jbrockmendel
jbrockmendel
Looking at this, isnt The Right Way to handle this to copy pandas' _MergeOperation code and adapt it?
>> Looking at this, isnt The Right Way to handle this to copy pandas' _MergeOperation code and adapt it? > think so, im just not familiar with this code. will...
Is there a standard pattern for "do X on each partition and collect the new partitions as a new series"?
> Is the default_to_pandas change unrelated to the cached_property change? If so, I think the changes should get separate issues and PRs, even though each one is very small. sure.
Cool. Are there any other methods that we should get while we're at it? Are there any scenarios in which these don't make copies? If so then "take" might not...
> So I guess I would count the results as _not_ copies. I think I worded the question poorly. I meant "copy" as in "copying the underlying data, which can...
> A typical loc or iloc will create a new pandas API object (series or dataframe), query compiler, modin frame, and partitions, which will ultimately get new references to new...
> This suggests to me that calculating widths is not itself the problem. Agreed. However I'm finding that 7-8% of my runtime is in _row_lengths (tentatively appears to be via...
Looks like the relevant cases in compute_dtypes are where `self._partitions.shape[1] > 1`. More specifically, I'm seeing cases with partition shapes (1, 13) and (1, 4) compute in a few hundredths...
If I disable the `run_f_on_minimally_updated_metadata` portion of _compute_dtypes (specifically, by not going through tree_reduce), the hot spot moves to Partition.to_numpy. If I use BenchmarkMode.put(True), the time taken by compute_dtypes is...