activitysim
activitysim copied to clipboard
Pandas 2 `assign_in_place()` fix with categoricals
ActivitySim models often call the activitysim.core.util.assign_in_place(df, df2, ...) method to update existing values in df using values from df2, or to add new columns from df2. When performing the former, it calls df.update(df2) to update common columns in place.
In Pandas 2.x, df.update(df2) will raise a TypeError if the common column C in df is of categorical dtype and df2['C'] contains category value(s) that are not present in df['C']'s categories.
This issue was encountered by SANDAG in the trip preprocessor of their airport model, see discussion #946 . None of the existing ActivitySim example models exhibit this issue.
The solution in this PR first makes sure df2['C'] is a categorical type, and then unions the two categoricals before calling df.update(df2).
Has this been tested? Can we write a test for it? It looks like it should work, but it also may have unexpected consequences (e.g. from reordering categories). I see the current tests are passing, but they were also passing without this so it's clear our current tests are not exercising the problem being solved.
@i-am-sijia, I just also saw your comment in https://github.com/ActivitySim/activitysim/discussions/946#discussioncomment-13338181 . Can we check if bringing "sort_categories=True" to the other existing usage[s] of union_categoricals will fail any of our existing tests?
@jpn-- I added sort_categories=True to all union_categoricals(). But now I am having second thoughts. Most categorical variables are created with choice alternatives being the pre-defined categories, using the same sort sequence of how the alternatives are defined. Categoricals may not be in the alphabetically order, but they are sorted and include all possible values. This makes categoricals stable under multiprocessing.
https://github.com/ActivitySim/activitysim/blob/ba4a12e85b3e4e794124bf39d4c2d006063462b1/activitysim/abm/models/mandatory_tour_frequency.py#L137-L141
We need union_categoricals() when new category values are added. You are right that under multiprocessing, the sequence of how the new categories are being appended may be unstable. But if we set sort_categories=True during union, it will also change the original sequence of the old category values.
- Should we sort all categoricals alphabetically at creation, not just when union is happening? Otherwise you are changing the sequence when union. So that they will be consistent through out.
- With
sort_categories=True, whenever a new category is append, it re-sorts everything instead of appending, which also impacts the stability. Would this be a problem for Sharrow, i.e., triggers re-compilation? - Side note - In general, I think we do not want to set
ordered=True, unless we want to write expressions directly compare categories. I reverted the ordered mode categories. [1] (I don't recall why I specifically made it ordered, I left a note in there in case it causes us trouble in the future)
In terms of unit tests. I was thinking we can start with the following:
- Concatenating two categorical columns with different categories
- Overwriting a categorical column with a non-categorical column
- Overwriting a categorical column with a categorical column with different categories
Maybe we are getting too far ahead of ourselves.
I agree that most categoricals are fully enumerated at creation time, in some logical order that may be meaningful to modelers (e.g. driving, then transit, then nonmotorized). And, most categoricals in ActivitySim are logically unordered.
For sharrow, any change in a categorical data type (adding categories, removing unused categories, reordering them, making them "ordered" when they were previously just nominal), literally any change at all will result in recompiling. As long as we get stable sort ordering most of the time, we can survive with corner cases that don't have stable order (e.g. the sample is too small and some unusual tour type doesn't always get included).
So, maybe we just merge this to fix the bug encountered as reported at the top, and leave a more complete solution to efficiently handling categoricals until ActivitySim 2.x?