Duplicate tour and trip ids leads to failure in write_trip_matrices step
Describe the bug While testing the newest version of ActivitySim (1.4.0) with the MWCOG model, we encountered a crash in our write_trip_matrices step due to duplicate tour ids. The error message is as follows:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
A possible source of this issue is that the dtype of the person_id column, i.e., the index of the persons table is no longer set to be a 64-bit integer, but a 32-bit integer, instead. When generating the tour_id (see canonical_ids.py:line414), the person_id is multiplied with the possible_tours_count variable, which, in our case, is around 40. Since the person_id column is a 32-bit integer the multiplication can result in an overflow, as there is no implicit cast for this type of operation. This leads to negative tour ids, which, though unique now, may lead to non-unique values later down the line (tour and trip ids) when used in different models or when new tour ids are generated.
We confirmed that that this might be an issue by explicitly casting the person_id column to a 64-bit integer in canonical_ids.py:line414. This led to the desired result, with unique, non-negative tour and trip ids, and ActivitySim did not crash.
Expected behavior When generating a tour id, a unique, non-negative id should be generated and assigned accordingly, irrespective of how large the person id value may be or it's data type.
Additional context
- Version 1.4.0 was initially tested without making any configuration changes to our ActivitySim 1.3.4 model setup.
- As a potential fix, we tried setting the person id column explicitly to be a 64-bit integer when loading in the persons table, but that did not result in the desired outcome. Whenever the person_id column was extracted and used for another table, it was cast back to a 32-bit integer.
Hi @Syonv, this issue should be resolved by a fix we made in PR #956. [1]
I haven't tested the specific line you are pointing to, but I'm 99% sure this issue is because Pandas 2.x got rid of Int64Index dtype, it no longer converts an int32 index column into int64. The relevant code snippet that leads to, e.g., person_id being int32, are:
https://github.com/ActivitySim/activitysim/blob/022775bec56c9ce8a39cb8c1da65a96ae9d99a22/activitysim/core/input.py#L197-L198
On Windows machines, .astype(int) on Line 197 returns int32 – this is because NumPy uses the platform’s default C int size, which is int32 on Windows, and int64 on Linux and macOS. In Pandas 1.x, set_index() on Line 198 will then convert the input int32 type to Int64Index, so the resulting index becomes int64. But Pandas 2.x has removed the Int64Index class and therefore set_index() on Line 198 will keep the input dtype for the index, which is still int32.
The solution we implemented in #956 is to explicitly call astype(np.int64) on Line 197. I'm happy to test the line you pointed to for tour_id and confirm (stay tuned).
Update
@Syonv I tested the MTC example, and confirm that with the code change I mentioned above, person_id is int64 in canonical_ids.py:line414. I just see that you mentioned that you tried setting person_id to int64 but it did not work, did you try the same way as I did?
Hi @i-am-sijia, thanks for looking into this!
No, I had not tried the method you used. Great to know that a fix has already been implemented!
@Syonv sure, thank you for reporting it! I would suggest keeping this issue open until the PR is merged and ActivitySim gets a new release, just in case others may run into the same crash as you did!