pystore
pystore copied to clipboard
pyarrow: TypeError: __cinit__() got an unexpected keyword argument 'times'
I stumpled over the following bizarre error when writing data to a collection:
in pyarrow._parquet.ParquetWriter.__cinit__()
TypeError: __cinit__() got an unexpected keyword argument 'times'
The error is caused by:
in Collection.write(self, item, data, metadata, npartitions, overwrite, epochdate, reload_items, **kwargs)
dd.to_parquet(data, self._item_path(item, as_string=True), overwrite=overwrite, compression="snappy", engine=self.engine, **kwargs)
in to_parquet(df, path, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, filesystem, engine, **kwargs)
After some research I found this line responsible https://github.com/ranaroussi/pystore/blob/db73c64cdd486bdf037728d733b0380f7c2d2023/pystore/collection.py#L123 as it adds the keyword argument 'times' that is forwarded through all functions but not referenced by dask or ParquetWriter. This addition is done when any row-Index has a 1 on the nanosecond-decimal e.g. from measurements or import of data:
import pandas as pd
import numpy as np
from pystore import store
index = pd.date_range('1/1/2024 00:00:00', '1/1/2024 10:00:00', freq='1s')
index += pd.to_timedelta(np.random.default_rng().integers(low=0, high=10, size=len(index)) , unit='ns') # Generate random fragments such as inaccuracies
columns = ["A", "B", "C"]
data = np.random.rand(len(index), len(columns))
df = pd.DataFrame(data=data, index=index, columns=columns)
If you try to save this data to a collection this will fail:
Store = store("ExampleStore")
collection = Store.collection("TestCollection")
Store.collection("TestCollection").write("TestItem", df, overwrite=False)
while rounding the index beforehand will succeed:
df.index = df.index.round(freq="0.000001s")
Store.collection("TestCollection").write("TestItemRoundedIndex", df, overwrite=False)
I can't understand why the argument is inserted at this point – does it come from the version where fastparquet was the engine? The majority of users probably won't use a temporal resolution in nanoseconds, but if an entry with 1ns occurs by chance due to inaccuracies, measuring devices or similar, the search for the cause is difficult.