activitysim icon indicating copy to clipboard operation
activitysim copied to clipboard

Update to use pandas 2.x

Open jpn-- opened this issue 1 year ago • 1 comments
trafficstars

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

  • DataFrame Index objects are all one class with different datatypes, instead of being different classes (e.g. there is no more Int64Index class).
  • The read_csv function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python object None.
  • The groupby operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).
  • A simple df.join() also potentially sorts the resulting rows differently unless an explicit sort argument is given.
  • Index objects no longer can be checked as is_monotonic but instead need is_monotonic_increasing.
  • The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).

jpn-- avatar Mar 25 '24 03:03 jpn--

While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

non-sharrow test timings for pandas 2.x:

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in pandas.core.internals.managers.BlockManager.get_dtypes, which is getting called from df.eval, but we almost certainly do not want to mess around with pandas internals.

jpn-- avatar Apr 03 '24 23:04 jpn--