activitysim Prototype Data Model with Single Component under ActivitySim v1.x

Pandas 2 enforces stricter data type handling. For example, appending two categorical columns with slightly different categories now raises an error. In https://github.com/ActivitySim/activitysim/pull/948, we implemented a solution that unions categories when this occurs. This also sparked discussions around category sorting and ordering; see the PR for details.

A more stable, long-term solution would be to define all expected values in a data model, such as one proposed in https://github.com/ActivitySim/activitysim/issues/617.

As raised in the 2025-07-24 Engineering Meeting, this issue proposes developing a prototype data model using a single model component to validate architecture, integration flow, and performance characteristics under the current ActivitySim framework (i.e., v1.x). The goal is to test a minimal working example that can serve as a foundation for future full-model expansions.

Feel free to flesh out more details in the issue (I'll be out starting tomorrow). @DavidOry @ActivitySim/engineering

Jul 29 '25 01:07 i-am-sijia

Starting a list of questions/thoughts for the @ActivitySim/product team:

Implementing a data model for a subset of ActivitySim could be a relatively low cost/low effort way to demonstrate the value of a data model as a documentation and verification tool, as well as facilitate how ActivitySim handles categorical variables (see #948). This could also give us a better understanding of the benefits and costs of data models in the context of ActivitySim 2.0.
The goals of using a data model in this limited way could be to (i) improve model documentation, (ii) remove strings prior to model executioner rather using them in utility expressions and/or coercing strings to categorical or enums, (iii) reduce redundant variable definitions, and (iv) reduce problems with categorical variables (see #948) in the software. Anything else?
A logical place to start may be with "trips", with the data model for trips incorporated into the trip mode choice model code. Trip mode choice is, for many ActivitySim implementations, the last travel modeling step in the sequence, with subsequent ActivitySim steps devoted to writing trip tables and other administrative tasks. If the data model integration changed a data type, from say a categorical to an enum, the downstream changes would be minimized if we started with trip mode choice.
In the current ActivitySim implementation, variable names and data types are written to text files by the write_data_dictionary method. A data model has the ability to improve on this be giving the model owner a logical place to define variables, data types, and enumerated values that can be converted to searchable documentation (e.g., see the example assembled for ActivitySim previously). Data models can also be easily exported to Word or Excel data dictionaries. If fully implemented, data models could remove the need for the write_data_dictionary code.
In the current ActivitySim implementation, integer mapping of modes can be done by defining a dictionary in the model-specific YAML file (e.g., see here). This allows for strings (rather than integer indices) to be used in utility and pre-processor calculations (e.g., see here). But the scope of the dictionaries are individual models, which could require the same dictionaries to be created more than once. See, for example, the definition of county integers in the prototype_mtc model in the Free Parking and Automobile Ownership steps.
Where should the data models live? Let's assume for now that we use Pydantic, so the data models would be Python modules. Should they live in a separate repository? In the configs folder with the UECs? They will be model implementation specific.
Another useful feature of data models is verification. A data model for ActivitySim trips could be used to verify ActivitySim output, i.e., are all the variables created by ActivitySim defined in the data model? Do we want explore this in the initial implementation? This would be analogous to the input_checker, though for the output. And given the possibility that ActivitySim 2.0 moves away from pandas, we should probably not use the pandera package (the pydantic package can verify data directly).
Data models can also be used to calculate derived variables when invoked. This could set the stage for replacing "pre-processing" or "post-processing" steps with the data model. Model owners, in this case, could write Python expressions to calculate variables rather than using utility expressions. Do we want to explore this in the initial implementation?
If we proceed in this direction, should the testing be done on the MTC prototype or the SANDAG example? What are the downstream implications if data types are changed (e.g., from strings to enums)?

Starting a list of questions for the community team:

Are any agencies interested in piloting this?
How would users feel about the exchange of value, i.e., having to maintain a data model in order for ActivitySim to run in exchange for improved documentation and less error prone utility expressions?
Would users want the option to write variable derivations in Python rather than in CSV files (see point 8. above)?
How are users currently creating data dictionaries for their ActivitySim output?

Starting a list of questions for the @ActivitySim/engineering:

Would this make the code easier or harder to maintain?
Once a data model for trips is created, how much work would it be to incorporate it into the trip_mode_choice code?

Aug 04 '25 20:08 DavidOry

@DavidOry Regarding pandera, they do offer support for other Data Frame libraries, including Polars. Pandera is also integrated with Pydantic now, which could offer performance gains compared to other methods to validate DataFrames in Pydantic, such as having to convert the data to a list of dictionaries.

Aug 26 '25 16:08 stefancoe

@DavidOry @i-am-sijia Have you done any recent testing with Pydantic, such as running the example data model on a full set of data?

To evaluate performance of Pandera/Pydantic integration I mentioned above, I wrote a test (available on google colab) that uses a Pandas Dataframe with one column and 1 million records with values between 0 and 10. The test performs validation using Pydantic with and without Pandera. On one of our modeling computers, Pydantic alone can process the entire column in about 5 seconds and about .03 seconds when using Pandera.

Regarding 7 & 8: I feel that, ideally, the data model would be integrated into the pipeline to perform data verification/validation within steps and potentially be used to calculate derived variables to, as you say, set the stage for replacing pre and post processors. I know it’s a low probability, but there is always a chance that the data model gets implemented for all models/steps in Activitysim 1.x instead of waiting for 2.0. With that in mind, I worry that a Pydantic data model, from a performance perspective, will be slow and inefficient. Especially if it includes row-based relationships that are processed and validated through loops, which Pandera integration cannot help.

At a recent engineering meeting, Jeff demonstrated an Activitysim 2.0 protype using JAX that included its own data model that does not rely on either Pydantic or Pandera. So there is no guarantee that the data model libraries and design we choose for this work will be used for Activitysim 2.0.

The Pandera package itself offers some interesting features that seem to be ready made for the points Dave makes in 7 & 8 and elsewhere. These include 'Preprocessing with Parsers' and 'Decorators for pipeline Integration'. I really think its worth considering using Pandera for this work.

Thanks!

Aug 28 '25 16:08 stefancoe

@stefancoe Thanks for sharing the notebook. It's interesting to see how pandera is incorporating pydantic elements -- I like it. I don't have a strong preference for this initial prototype, as I think both can accomplish our first set of objectives on this arc. (I remain leery of locking ourselves into dataframes for the longer term, but that's a separate conversation, and this is a pretty dynamic space -- as illustrated by your JAX reference -- so I'm sure we'll pivot a few more times regardless of the choices we make now). I think the our primary goal here is to introduce the idea of a data model and start using it.

Our recent experience with Pydantic 2.x has been with survey data processing (we're presenting it as part of the MoMo Session on survey standardization), in which case the small runtime differences were not relevant.

Perhaps this is a good segue into identifying and prioritizing the requirements? Is verification/validation high on your list? Here's my first pass in prioritized order:

Right now users define trip modal options as strings and activitysim converts these strings to factors. With pandas 2.0, working with factors/categorical variables is more difficult, requiring awkward workarounds. The first requirement of the solution is therefore to allow users to define trip modal options as enums, which will allow them to use string-like words to refer to the modal options, while allowing the activitysim software to use integer-like representations. (It seems pandera and pydantic can do this equally well).
When a user defines the trip modal options, we would like them to be able to do so in one and only one place. Ideally, they would not define them in a settings file, and then in the utility expressions, and then in their outside-the-model-documentation, and then in standalone summary scripts, etc. Once this definition is established, we want to use it in activitysim. And use it when we create automated model documentation. And when we are doing ad hoc data analysis of the model output. We want to define once and use over and over. (It seems pandera and pydantic can do this equally well).
We want the user to see value in creating these definitions and want to do it, i.e., we want this to replace things, make things more efficient, not just add another layer. If this is not seen as a net win for the user, then we are better off retaining the awkward workaround in the code and letting the user keep doing what they are doing. For this reason, I think demonstrating how the data model could be used outside of activitysim is an important part of this work. (It seems pandera and pydantic can do this equally well).
IFF we are successful with (1) - (3), then we can develop a broader strategy for using data models moving forward. Can we build one out for the full set of models? Are we using the right technology? If users like this approach to documentation, can we verify/validate the output to ensure alignment between the data model and the activitysim implementation? Can we demonstrate what it would look like if we used the data model to replace select pre- and post-processors? And on and on. This may be where pandera's speed advantage becomes relevant, but I see this as a subsequent phase.

What do you think?

Aug 28 '25 18:08 DavidOry

Thanks @DavidOry, this seems like a great approach and I like how you have prioritized tasks/requirements. Looking forward to its implementation!

Aug 28 '25 22:08 stefancoe

I was asked to review this since I prioritized it for Phase 11C. I think that @DavidOry has defined it fairly well and implementing 1-3 from his Aug 28 comment for Trip Mode Choice would be a reasonable scope for this initial demonstration.

Oct 31 '25 20:10 lmz