activitysim
activitysim copied to clipboard
Update to use pandas 2.x
Addresses #794.
The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:
- DataFrame
Indexobjects are all one class with different datatypes, instead of being different classes (e.g. there is no moreInt64Indexclass). - The
read_csvfunction by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python objectNone. - The
groupbyoperation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations). - A simple
df.join()also potentially sorts the resulting rows differently unless an explicitsortargument is given. Indexobjects no longer can be checked asis_monotonicbut instead needis_monotonic_increasing.- The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).
While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow.
non-sharrow test timings for pandas 1.x:
58.60s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
non-sharrow test timings for pandas 2.x:
148.50s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in pandas.core.internals.managers.BlockManager.get_dtypes, which is getting called from df.eval, but we almost certainly do not want to mess around with pandas internals.