activitysim icon indicating copy to clipboard operation
activitysim copied to clipboard

Prototyping / Research on using POLARS for future versions

Open jpn-- opened this issue 2 months ago • 0 comments

Shifting ActivitySim’s internals from Pandas to Polars could have some very tangible benefits, because Polars is designed for performance and scalability in ways that align nicely with the demands of large-scale activity-based travel models.


Performance & Parallelism

  • Multi-threaded execution: Unlike Pandas (which is mostly single-threaded), Polars automatically parallelizes operations across CPU cores. ActivitySim workloads — especially skimming, accessibility calculations, and large-scale household/person-level processing — can benefit from this parallelism without needing manual multiprocessing.
  • Vectorized Rust engine: Polars is built in Rust with efficient memory layout and SIMD acceleration. For big matrices (e.g., OD matrices, synthetic population attributes, choice utilities), this can lead to large speedups compared to Pandas.

Memory Efficiency

  • Columnar memory model (Apache Arrow under the hood): Polars uses Arrow-style columnar storage, which is cache-efficient and reduces memory footprint. This is important for ActivitySim because:

    • Synthetic populations can have tens of millions of rows (households × persons × tours × trips).
    • Columnar formats make it easier to pass around large data slices without copies, minimizing RAM spikes.
  • Lazy evaluation: Polars can defer execution and optimize query plans (like SQL databases). This reduces redundant computation, which is useful in ActivitySim pipelines where derived variables are often recomputed across models.


Scalable Joins and Group-bys

  • Much faster joins: ActivitySim frequently joins household, person, trip, and land-use data. Polars implements hash joins and parallel group-bys that scale much better than Pandas when tables are very large.
  • Streaming groupby / aggregations: For logsum calculations and model summaries, Polars can process results in a streaming fashion, handling much larger datasets than Pandas without blowing up memory.

Interoperability

  • Arrow ecosystem: Since Polars uses Arrow internally, it integrates well with other tools in the data ecosystem — e.g., PyArrow for parquet/feather IO, or direct handoff to machine learning frameworks. ActivitySim already uses parquet in some workflows, so Polars makes this path more natural.
  • Export back to Pandas: For users who still want to work with Pandas for inspection, you can easily .to_pandas() at the boundary of the modeling workflow. This gives developers the best of both worlds.

Developer Productivity

  • Expressive query language: Polars has an API that feels similar to Pandas but encourages a more SQL-like, chainable query style. This makes model specification pipelines more declarative and potentially easier to read/optimize.
  • Lazy API for pipeline optimization: You could imagine specifying model steps (skimming, accessibility, utilities, choices) as lazy transformations, then letting Polars optimize execution order — reducing redundant scans through big tables.

Specific ActivitySim Use Cases That Benefit

  1. Large synthetic populations (millions of agents) → parallel group-bys for person/household aggregations.
  2. Destination choice models → repeated joins across households/persons and zonal data, faster joins = faster simulation.
  3. Skimming & accessibility → heavy matrix operations and aggregations scale better with columnar, parallel execution.
  4. Choice model pipelines → lazy evaluation could eliminate unnecessary recalculations of derived variables.
  5. Diagnostics → fast descriptive stats and summaries over huge datasets without downsampling.

Downsides


Ecosystem & Compatibility Risks

  • Smaller ecosystem vs Pandas: Pandas is the de facto standard in Python data science. Many libraries (e.g., statsmodels, scikit-learn, PyMC, matplotlib) expect Pandas DataFrames.

    • ActivitySim relies on some of these for estimation/calibration and diagnostics, so you may need frequent .to_pandas() conversions — which could eat into performance gains.
  • Less community maturity: Pandas has decades of community usage, bug fixes, and user expertise. Polars is newer and evolving quickly, which means less stability and fewer online examples/tutorials.


API & Feature Gaps

  • Not 1:1 Pandas replacement: Polars deliberately doesn’t implement every Pandas feature. Common Pandas idioms in ActivitySim (e.g., hierarchical indexing, inplace mutation, complex apply calls with Python lambdas) may not translate directly.
  • Less flexible for arbitrary Python functions: Pandas allows you to shove in arbitrary Python code via .apply() or .transform(). Polars prefers vectorized or expression-based operations. While better for performance, this may require refactoring many ActivitySim expressions into Polars’ style.
  • Sparse or specialized data handling: If ActivitySim uses sparse matrices for skims or certain choice models, Polars doesn’t have strong built-in sparse support (though Arrow may help indirectly).

Performance Tradeoffs

  • Overhead for small dataframes: Polars really shines with large tables (millions of rows). For smaller model components (e.g., sample-of-alternatives, calibration routines), Polars may be slower than Pandas due to setup overhead.
  • Conversions back to Pandas: If parts of the workflow need Pandas (e.g., for estimation or visualization), constant conversion between Polars ↔ Pandas can neutralize performance benefits.

Developer & User Adoption

  • Learning curve: ActivitySim developers and users are very familiar with Pandas. Polars’ more functional, SQL-like API might be less intuitive, especially for non-programmer modelers who occasionally read or modify the pipeline code.
  • Breaking changes to APIs: If ActivitySim’s public-facing API currently exposes Pandas DataFrames (for inputs/outputs/inspection), moving to Polars may break user scripts or extensions. Supporting both might double maintenance effort.
  • Fewer debugging/inspection tools: Many debugging workflows in Python are Pandas-centric (Jupyter DataFrame display, Pandas-profiling, etc.). Polars integration is improving but not as seamless.

Strategic Risks

  • Lock-in to a newer library: Polars is growing fast, but still younger. A major shift in development direction could pose long-term maintenance risks for ActivitySim.
  • Consortium/user readiness: Because ActivitySim is consortium-driven, agencies and MPOs may be conservative about adopting workflows that diverge from the “standard” Pandas-based Python data science stack.

jpn-- avatar Sep 30 '25 17:09 jpn--