activitysim
activitysim copied to clipboard
Add option for output tables to be written as parquet files.
Is your feature request related to a problem? Please describe. Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.
Describe the solution you'd like Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:
- When h5_store is set to True outputs are written out to h5.
- When h5_store is set to False (default) and file_type is not specified, outputs are written as .csv
- When h5_store is set to False (default) and file_type is specified, outputs are written out to its setting: csv, parquet or h5.
- file_type is validated against allowed values (csv, parquet, h5) using pydantic. Activitysim will crash with a useful error message almost immediately if this setting is included with a wrong value.
Describe alternatives you've considered Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.
Additional context I have made these changes on a fork and will issue a pull request.