pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex

Open derHeinzer opened this issue 5 years ago • 10 comments

When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.

Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:

build reproducable setup

iterables = [range(10000), pd.date_range('2020-01-01', periods=200)] idx = pd.MultiIndex.from_product(iterables, names=['id', 'date']) df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"]) new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id print(df.shape, new_col.shape)

profile performance of set_item

import cProfile pr = cProfile.Profile() pr.enable() df['new_col'] = new_col['value'] pr.disable() pr.print_stats(sort=2)

derHeinzer avatar Jun 02 '20 14:06 derHeinzer

Hey, thanks for your report.

Seems to persist on master. Took me 27 seconds to run this.

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   27.246   27.246 frame.py:2965(__setitem__)
        1    0.000    0.000   27.246   27.246 frame.py:3046(_set_item)
        1    0.000    0.000   27.245   27.245 frame.py:3636(_sanitize_column)
        1    0.000    0.000   27.245   27.245 frame.py:3656(reindexer)
        1    0.000    0.000   27.245   27.245 series.py:4286(reindex)
        1    0.002    0.002   27.245   27.245 generic.py:4236(reindex)
        1    0.000    0.000   27.243   27.243 generic.py:4481(_reindex_axes)
        1    0.000    0.000   27.239   27.239 multi.py:2272(reindex)
        1    0.000    0.000   27.073   27.073 multi.py:2429(get_indexer)
        1    2.044    2.044   27.073   27.073 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
        1    0.000    0.000   22.136   22.136 base.py:1034(__iter__)
        3    0.017    0.006   22.136    7.379 multi.py:640(_values)
        3    0.000    0.000   22.120    7.373 multi.py:645(values)
        1    0.000    0.000   20.054   20.054 extension.py:269(astype)
        1    0.000    0.000   20.054   20.054 datetimes.py:577(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:613(astype)
        1    0.000    0.000   20.054   20.054 datetimelike.py:473(_box_values)
        1    2.044    2.044   20.054   20.054 {pandas._libs.lib.map_infer}
  2000000   14.756    0.000   18.009    0.000 datetimes.py:474(<lambda>)
        2    0.000    0.000    2.884    1.442 base.py:2938(get_indexer)
        5    0.024    0.005    2.867    0.573 base.py:5533(ensure_index)
     11/5    0.000    0.000    2.844    0.569 base.py:283(__new__)
  2000001    1.610    0.000    2.592    0.000 datetimes.py:498(tz)
        3    0.000    0.000    2.429    0.810 base.py:5700(_maybe_cast_data_without_dtype)
        1    2.047    2.047    2.047    2.047 {pandas._libs.lib.fast_zip}
        6    1.323    0.221    1.323    0.221 {pandas._libs.lib.infer_dtype}
        2    0.000    0.000    1.272    0.636 datetimes.py:297(_from_sequence)
        2    0.000    0.000    1.272    0.636 datetimes.py:1873(sequence_to_dt64ns)
        1    0.000    0.000    0.729    0.729 datetimes.py:2003(objects_to_datetime64ns)
        1    0.729    0.729    0.729    0.729 {built-in method pandas._libs.tslib.array_to_datetime}

Rest is omitted

phofl avatar Jun 02 '20 16:06 phofl

there looks to be some unwanted conversions going on here. welcome for folks to have a look.

jreback avatar Jun 02 '20 23:06 jreback

you can work around this for now by:

In [40]: %time pd.merge(df.reset_index(), new_col.reset_index(), on=['id', 'date'], how='outer').set_index(['id', 'date'])                                 
CPU times: user 2.37 s, sys: 97.9 ms, total: 2.47 s
Wall time: 658 ms
Out[40]: 
                  value_x   value_y
id   date                          
0    2020-01-01  0.295513       NaN
     2020-01-02  0.274984  0.274984
     2020-01-03 -0.392445 -0.392445
     2020-01-04 -1.443801 -1.443801
     2020-01-05  0.897258  0.897258
...                   ...       ...
9999 2020-07-14  0.614013  0.614013
     2020-07-15  0.209266  0.209266
     2020-07-16  0.042773  0.042773
     2020-07-17 -0.657113 -0.657113
     2020-07-18 -1.629856 -1.629856

[2000000 rows x 2 columns]

jreback avatar Jun 02 '20 23:06 jreback

Also related to https://github.com/pandas-dev/pandas/issues/23735

qiuwei avatar Apr 20 '21 07:04 qiuwei

@jreback is there any way to get around this for reindex as well? Thanks in advance!

EDIT: I realized that the same trick could be applied to reindex operation:

pd.merge(df.to_frame(names=["id", "date"]), new_col.reset_index(), on=['id', 'date'], how='left').set_index(['id', 'date'])

qiuwei avatar Apr 20 '21 07:04 qiuwei

@qiuwei it would be helpful to look into actually what is happening and propose a patch

jreback avatar Apr 20 '21 10:04 jreback

This issue doesn't look to persist on main anymore. Could use an asv benchmark

mroeschke avatar Feb 02 '24 20:02 mroeschke

I'd like to take a stab at making an asv benchmark. Sometime this week.

dshemetov avatar May 06 '24 18:05 dshemetov

First, does this issue still need work? It looks pretty old.

I've never contributed to an open source project before (besides fixing typos in documentation), but I do have 1 or 2 years of Python experience and I've read the "Contributing to pandas" page on the website. Is there anything I still need to do before I assign this to myself and start working?

sumwun1 avatar May 31 '24 22:05 sumwun1

Hi, can i contribute to this issue if it's still open?

shriyase avatar Jun 21 '24 20:06 shriyase