pystore append error

Hello,

Have you any recommendations regarding importing data from arctic? I'm currently using cryptostore with arctic as a backend. Cryptostore is by the very same author of arctic, but loading trades as a dataframe takes too much time with it.

For now, this is what I did :

import pystore
from arctic import Arctic

exchange = "BITFINEX"
datastore = "mydatastore"

arctic_store = Arctic("localhost")
arctic_lib = arctic_store[exchange]
symbols = arctic_lib.list_symbols()

store = pystore.store(datastore)
collection = store.collection(exchange)
for symbol in symbols:
    df_src = arctic_lib.read(symbol)
    if symbol in collection.list_items():
        item = collection.item(symbol)
        df_dst = item.to_pandas()
        # https://stackoverflow.com/a/44318806
        df_diff = df_src[~df_src.index.isin(df_dst.index)]
        rows, columns = df_diff.shape
        if df_diff.empty:
            print("No new row to append...")
        else:
            print(f"Appending {rows} rows to {symbol} item")
            collection.append(symbol, df_diff)
    else:
        rows, columns = df_src.shape
        print(f"Importing {symbol} for the first time w/ {rows} rows and {columns} columns")
        collection.write(symbol, df_src, metadata={'source': 'cryptostore'})

But I'm facing errors similar to #16 - even if rollbacking dask and fastparquet to previous releases - when append is happening.

    raise ValueError("Exactly one of npartitions and chunksize must be specified.")
ValueError: Exactly one of npartitions and chunksize must be specified.

my setup :

dask==2.6.0
fastparquet==0.3.2
numba==0.46.0

Thanks, and keep the good work!

Nov 05 '19 11:11 flamby

It seems one has to retrieve npartitions from original dask dataframe, and pass it to append. So I fixed it this way:

collection.append(symbol, df_diff, npartitions=item.data.npartitions)

Will it work everytime?

Nov 06 '19 11:11 flamby

thank you @flamby

this fix is working and thanks for sharing and saving time for others.

the pystore notebook demo too works only with this fix, else throws an error:

ValueError: Exactly one of npartitions and chunksize must be specified.

great thanks to @ranaroussi for this wonderful library

Dec 07 '19 07:12 viveksethu

Thank you to @ranaroussi for this nices libraries and thank you to @flamby who fix this nasty bug in the Windows 10 environment ! I had exactly the same message ("Exactly one of npartitions and chunksize must be specified") and the append was impossible. Now, it's work. Thank you again.

Dec 15 '19 19:12 XBKZ

Hello, same here (Win10 environment)! Thanks for the fix @flamby !

Jan 09 '20 07:01 yohplala

The problem is that dd.from_pandas() checks: if (npartitions is None) == (chunksize is None): raise ValueError("Exactly one of npartitions and chunksize must be specified.")

So when the append function calls dd.from_pandas(df, npartitions = None) it raises the error but if you call dd.from_pandas(df, npartitions = None, chunksize=100000) it works. Presumably dask is using npartitions = 1 as its default even though the api says npartitions is optional and doesn't list a default.

The code below is what needs to be tweaked. The new variable could be set to use npartitions = 1 (new = dd.from_pandas(data, npartitions=1), since this will be superseded by the passed value after the dataframes are combined. I'm willing to bet Ran comes up with a more elegant solution though.

https://github.com/ranaroussi/pystore/blob/40de1d51236fd6b6b88909c83dc6d7297de4b471/pystore/collection.py#L180-L190

Feb 20 '20 15:02 JugglingNumbers

pystore pystore copied to clipboard

append error

pystore
pystore copied to clipboard