Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex
When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.
Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:
build reproducable setup
iterables = [range(10000), pd.date_range('2020-01-01', periods=200)] idx = pd.MultiIndex.from_product(iterables, names=['id', 'date']) df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"]) new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id print(df.shape, new_col.shape)
profile performance of set_item
import cProfile pr = cProfile.Profile() pr.enable() df['new_col'] = new_col['value'] pr.disable() pr.print_stats(sort=2)
Hey, thanks for your report.
Seems to persist on master. Took me 27 seconds to run this.
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 27.246 27.246 frame.py:2965(__setitem__)
1 0.000 0.000 27.246 27.246 frame.py:3046(_set_item)
1 0.000 0.000 27.245 27.245 frame.py:3636(_sanitize_column)
1 0.000 0.000 27.245 27.245 frame.py:3656(reindexer)
1 0.000 0.000 27.245 27.245 series.py:4286(reindex)
1 0.002 0.002 27.245 27.245 generic.py:4236(reindex)
1 0.000 0.000 27.243 27.243 generic.py:4481(_reindex_axes)
1 0.000 0.000 27.239 27.239 multi.py:2272(reindex)
1 0.000 0.000 27.073 27.073 multi.py:2429(get_indexer)
1 2.044 2.044 27.073 27.073 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
1 0.000 0.000 22.136 22.136 base.py:1034(__iter__)
3 0.017 0.006 22.136 7.379 multi.py:640(_values)
3 0.000 0.000 22.120 7.373 multi.py:645(values)
1 0.000 0.000 20.054 20.054 extension.py:269(astype)
1 0.000 0.000 20.054 20.054 datetimes.py:577(astype)
1 0.000 0.000 20.054 20.054 datetimelike.py:613(astype)
1 0.000 0.000 20.054 20.054 datetimelike.py:473(_box_values)
1 2.044 2.044 20.054 20.054 {pandas._libs.lib.map_infer}
2000000 14.756 0.000 18.009 0.000 datetimes.py:474(<lambda>)
2 0.000 0.000 2.884 1.442 base.py:2938(get_indexer)
5 0.024 0.005 2.867 0.573 base.py:5533(ensure_index)
11/5 0.000 0.000 2.844 0.569 base.py:283(__new__)
2000001 1.610 0.000 2.592 0.000 datetimes.py:498(tz)
3 0.000 0.000 2.429 0.810 base.py:5700(_maybe_cast_data_without_dtype)
1 2.047 2.047 2.047 2.047 {pandas._libs.lib.fast_zip}
6 1.323 0.221 1.323 0.221 {pandas._libs.lib.infer_dtype}
2 0.000 0.000 1.272 0.636 datetimes.py:297(_from_sequence)
2 0.000 0.000 1.272 0.636 datetimes.py:1873(sequence_to_dt64ns)
1 0.000 0.000 0.729 0.729 datetimes.py:2003(objects_to_datetime64ns)
1 0.729 0.729 0.729 0.729 {built-in method pandas._libs.tslib.array_to_datetime}
Rest is omitted
there looks to be some unwanted conversions going on here. welcome for folks to have a look.
you can work around this for now by:
In [40]: %time pd.merge(df.reset_index(), new_col.reset_index(), on=['id', 'date'], how='outer').set_index(['id', 'date'])
CPU times: user 2.37 s, sys: 97.9 ms, total: 2.47 s
Wall time: 658 ms
Out[40]:
value_x value_y
id date
0 2020-01-01 0.295513 NaN
2020-01-02 0.274984 0.274984
2020-01-03 -0.392445 -0.392445
2020-01-04 -1.443801 -1.443801
2020-01-05 0.897258 0.897258
... ... ...
9999 2020-07-14 0.614013 0.614013
2020-07-15 0.209266 0.209266
2020-07-16 0.042773 0.042773
2020-07-17 -0.657113 -0.657113
2020-07-18 -1.629856 -1.629856
[2000000 rows x 2 columns]
Also related to https://github.com/pandas-dev/pandas/issues/23735
@jreback
is there any way to get around this for reindex as well?
Thanks in advance!
EDIT: I realized that the same trick could be applied to reindex operation:
pd.merge(df.to_frame(names=["id", "date"]), new_col.reset_index(), on=['id', 'date'], how='left').set_index(['id', 'date'])
@qiuwei it would be helpful to look into actually what is happening and propose a patch
This issue doesn't look to persist on main anymore. Could use an asv benchmark
I'd like to take a stab at making an asv benchmark. Sometime this week.
First, does this issue still need work? It looks pretty old.
I've never contributed to an open source project before (besides fixing typos in documentation), but I do have 1 or 2 years of Python experience and I've read the "Contributing to pandas" page on the website. Is there anything I still need to do before I assign this to myself and start working?
Hi, can i contribute to this issue if it's still open?