Evaluate data quality when loading CSV data with the pyarrow engine

Open sowdm opened this issue 11 months ago • 0 comments

For all CSV datasets in OPD, evaluate any differences in the resulting the data when loading with and without the pyarrow engine (see pandas read_csv link below). If we ultimately use the pyarrow engine in OPD to more efficiently load CSV files, it is imperative that there are no issues with the resulting data (the pandas documentation indicates that all features were not available with the pyarrow engine as of pandas version 1.4.0).

The following will read all datasets from the OPD source table: df=pd.read_csv("https://raw.github.com/openpolicedata/opd-data/main/opd_source_table.csv")

Filter the DataType column of df by CSV to get all CSV files. The URLs of the files are contained in the URL column. Please ignore any cases where the dataset_id column is not empty.

This should be evaluated outside of OPD using pandas read_csv function. https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv

May 10 '25 19:05 sowdm