pystore icon indicating copy to clipboard operation
pystore copied to clipboard

pyarrow: TypeError: __cinit__() got an unexpected keyword argument 'times'

Open JAC28 opened this issue 4 months ago • 0 comments

I stumpled over the following bizarre error when writing data to a collection:

in pyarrow._parquet.ParquetWriter.__cinit__()
TypeError: __cinit__() got an unexpected keyword argument 'times'

The error is caused by:

in Collection.write(self, item, data, metadata, npartitions, overwrite, epochdate, reload_items, **kwargs)
dd.to_parquet(data, self._item_path(item, as_string=True), overwrite=overwrite, compression="snappy", engine=self.engine, **kwargs)
in to_parquet(df, path, compression, write_index, append, overwrite, ignore_divisions, partition_on, storage_options, custom_metadata, write_metadata_file, compute, compute_kwargs, schema, name_function, filesystem, engine, **kwargs)

After some research I found this line responsible https://github.com/ranaroussi/pystore/blob/db73c64cdd486bdf037728d733b0380f7c2d2023/pystore/collection.py#L123 as it adds the keyword argument 'times' that is forwarded through all functions but not referenced by dask or ParquetWriter. This addition is done when any row-Index has a 1 on the nanosecond-decimal e.g. from measurements or import of data:

import pandas as pd
import numpy as np
from pystore import store

index = pd.date_range('1/1/2024 00:00:00', '1/1/2024 10:00:00', freq='1s')
index += pd.to_timedelta(np.random.default_rng().integers(low=0, high=10, size=len(index))  , unit='ns') # Generate random fragments such as inaccuracies 
columns = ["A", "B", "C"]
data = np.random.rand(len(index), len(columns))
df = pd.DataFrame(data=data, index=index, columns=columns)

If you try to save this data to a collection this will fail:

Store = store("ExampleStore")
collection = Store.collection("TestCollection")
Store.collection("TestCollection").write("TestItem", df, overwrite=False)

while rounding the index beforehand will succeed:

df.index = df.index.round(freq="0.000001s")
Store.collection("TestCollection").write("TestItemRoundedIndex", df, overwrite=False)

I can't understand why the argument is inserted at this point – does it come from the version where fastparquet was the engine? The majority of users probably won't use a temporal resolution in nanoseconds, but if an entry with 1ns occurs by chance due to inaccuracies, measuring devices or similar, the search for the cause is difficult.

JAC28 avatar Oct 17 '24 16:10 JAC28