pystore icon indicating copy to clipboard operation
pystore copied to clipboard

Append error: TypeError: Cannot compare tz-naive and tz-aware timestamps

Open yohplala opened this issue 4 years ago • 6 comments

Hello,

I am passing a tz-aware dataframe to pystore/append, and I get this error message.

 collection.append(item_ID, df, npartitions=item.data.npartitions)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\pystore\coll
ection.py", line 184, in append
    combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in concat
    for i in range(len(dfs) - 1)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in <genexpr>
    for i in range(len(dfs) - 1)
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 109, in pandas._libs.tslibs.c_timestamp.
_Timestamp.__richcmp__
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 169, in pandas._libs.tslibs.c_timestamp.
_Timestamp._assert_tzawareness_compat
TypeError: Cannot compare tz-naive and tz-aware timestamps

[EDIT] Here is a code that can be simply copy/past to reproduce the error message. Please, does someone sees what I can be possibly doing wrong?

import os
import pandas as pd
import pystore

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

# Connect to datastore (create it if not exist)
store = pystore.store('OHLCV')
# Access a collection (create it if not exist)
collection = store.collection('AAPL')
item_ID = 'EOD'
collection.write(item_ID, GC[:-1], overwrite=True)
item = collection.item(item_ID)
collection.append(item_ID, GC[-1:], npartitions=item.data.npartitions)

I thank you for your help. Have a good day, Bests, Pierrot

yohplala avatar Jan 09 '20 07:01 yohplala

Hello, I have updated the code so that anyone can execute it in a terminal and can reproduce the error (previous code was not working on its own, it needed a data file. I have made an extract that I have embedded in the code) Thanks in advance for any help and advice. Bests, Pierrot

yohplala avatar Jan 10 '20 06:01 yohplala

[ADDITION] Ok, I tested 1st the use of pandas concat() function (not using pystore). I don't have the error message. It would mean that the trouble is coming from dask dataframe handling?

Following code (direct use of pandas, not pystore/dask/parquet) works:

import os
import pandas as pd

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

combined = pd.concat([GC[:-1], GC[-1:]]).drop_duplicates(keep="last")

Problem is not solved.

yohplala avatar Jan 10 '20 06:01 yohplala

Hmm, it seems I don't succeed to reproduce the error in a script without having to re-write in depth collection.py. I am stopping the delving here (it seemed to me, it could be an error with my dataframe formatting maybe, that I could then submit either in stackoverflow or pandas github or dask if it was dask related) But I have no clue where the bug is without going further into dask.

As this is not my priority at the moment, I will only use the write() funciton of pystore, and when I will have to append data, I will do it with pandas concat() function, then write() with pystore using overwrite=True.

I hope this trouble in Windows 10 environment can be solved (I am hinting that this error, along with having to use 'npartitions=item.data.npartitions' in append() function may actually be linked)

Have a good day, Bests, Pierrot

yohplala avatar Jan 10 '20 07:01 yohplala

For those who are in the same case, here is an ugly workaround which logic I mention in above comment.

import os
import pandas as pd
import pystore

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

# Connect to datastore (create it if not exist)
store = pystore.store('OHLCV')
# Access a collection (create it if not exist)
collection = store.collection('AAPL')
item_ID = 'EOD'
collection.write(item_ID, GC[:-1], overwrite=True)

# WORKAROUND
# Re-create an append function

item = collection.item(item_ID)
current = item.to_pandas()
combined = pd.concat([current, GC[-1:]]).drop_duplicates(keep="last")
collection.write(item_ID, combined, overwrite=True)

Bests,

yohplala avatar Jan 10 '20 07:01 yohplala

I think that https://github.com/ranaroussi/pystore/blob/master/pystore/collection.py#L181 should combined = dd.concat([current.to_pandas(), new]).drop_duplicates(keep="last") instead of currently combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

@ranaroussi could you confirm ?

sdementen avatar Nov 26 '20 05:11 sdementen

probably related to issue https://github.com/dask/dask/issues/6925

sdementen avatar Dec 03 '20 14:12 sdementen