PythonCall.jl icon indicating copy to clipboard operation
PythonCall.jl copied to clipboard

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes

Open tomdstone opened this issue 2 years ago • 18 comments

When doing some work involving dataframes in python via PythonCall, it seems like DataFrame(PyTable(p)) where p is a pandas data table converts the date and datetime columns into byte vectors. Is this issue related to the issue #265 with milliseconds vs microseconds, or due to a missing part of the DataFrame(::PyPandasDataFrame) implementation?

Here are a few minimal examples, in a conda environment with pandas.

using PythonCall
using Dates
using DataFrames

a = DataFrame(x = [now()])  # julia dataframe
b = pytable(a)              # pandas dataframe
c = PyTable(b)              # PyPandasDataFrame
d = DataFrame(c)            # julia dataframe again

This results in:

julia> c
1×1 PyPandasDataFrame
                        x
0 2023-04-13 14:36:13.939

julia> d
1×1 DataFrame
 Row │ x
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc0, 0x62, 0x31, 0x8c, 0x…

The same thing happens when initially defining b as a pandas dataframe, so the microsecond issue in #265 seems to not be the problem?

julia> b = pd.DataFrame([[dt.datetime.now()]])
Python DataFrame:
                           0
0 2023-04-13 14:46:57.940077

julia> c = PyTable(b)
1×1 PyPandasDataFrame       
                           0
0 2023-04-13 14:46:57.940077

julia> d = DataFrame(c)
1×1 DataFrame
 Row │ 0
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc8, 0xf9, 0xa5, 0x7d, 0x…

tomdstone avatar Apr 13 '23 18:04 tomdstone

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

tomdstone avatar Apr 13 '23 18:04 tomdstone

Just looking at this now. It all comes down to the fact that numpy arrays containing numpy.datetime64 cannot be wrapped as a Julia array yet. Therefore it's falling back on converting each element, and because numpy.datetime64 actually implements the buffer protocol with itemtype=1byte, that gets converted to a PyArray{UInt8} as you're seeing.

So there's two things that could be fixed here:

  • Support wrapping arrays containing datetime64
  • Support converting datetime64 to something more useful than PyArray{UInt8}.

cjdoris avatar Apr 24 '23 17:04 cjdoris

(PS thanks for the report and the MWE)

cjdoris avatar Apr 24 '23 17:04 cjdoris

Just stumbled across that problem and solved it this way:

function pytimestamp_to_datetime(t::PyArray)
    DateTime(1970) + Second(reinterpret(Int64, t)[1] / 1000000000)
end

function pytimestamp_to_datetime(v::AbstractVector{<:PyArray})
    pytimestamp_to_datetime.(v)
end

EDIT: This only holds for datetime64[ns], otherwise the factor is different. I just filed a tentative PR for a general solution

hhaensel avatar Jun 29 '23 10:06 hhaensel

@cjdoris If I eliminate .values in line 61 of PyPandasDataFrame everything runs smoothly. Would that be an option or is there some performance issue or other issue that I am not aware of?

UPDATE: That doesn't work, columns will be vectors of Py objects

hhaensel avatar Jun 30 '23 10:06 hhaensel

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

@tomdstone Could you check if also your use case where you had difficulties with time is working with the PR in place? Otherwise, please share an MWE. I could try to cover your use case as well.

hhaensel avatar Jul 01 '23 17:07 hhaensel

Just added support for timedelta and timedelta64 conversion to Dates.CompoundPeriod.

hhaensel avatar Jul 02 '23 21:07 hhaensel

@hhaensel sorry, I won't have time to implement it due to how work is right now (we are in the middle of a move). I might follow up in a few months, but I definitely wouldn't hold your breath waiting for me. I appreciate the thought though!

tomdstone avatar Jul 03 '23 14:07 tomdstone

Thanks for the reply. No need to hurry. I just filed the PR as I was in need of a solution for my use case.

hhaensel avatar Jul 03 '23 15:07 hhaensel

@cjdoris I added automatic conversion from Periods/CompoundPeriods to numpy:timedelta64 as this supports nanoseconds as well. When I use pytimedeltatype, I can only go down to microseconds. Is that a good decision? It seems that pandas can convert the timedelta64 to a timestamp.timedelta that handles nanoseconds correctly. For conversion from timedelta to Julia I used Dates.CompoundPeriod in order to be type stable. Otherwise one could use Nanosecond, but that seemed a bit weired to me. I know you're working for version 1 to come out soon. I hope my proposal fits in there somehow.

hhaensel avatar Jul 04 '23 06:07 hhaensel

Just want to share what's working now with the PR

julia> x = Py(Second(1))
Python: numpy.timedelta64(1,'s')

julia> pyconvert(Second, x)
1 second

julia> x = Py(Second(1) + Nanosecond(1))
Python: numpy.timedelta64(1000000001,'ns')

julia> y = pyconvert(Any, x)
1 second, 1 nanosecond

julia> typeof(y)
Dates.CompoundPeriod


julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100]);
julia> pdf = pytable(jdf)
Python:
                         x               y
0  2023-07-04 11:29:27.781 0 days 00:00:01
1  2023-07-04 11:30:02.781 0 days 00:00:02
2  2023-07-04 11:40:17.781 0 days 00:00:03
3  2023-07-04 11:31:11.781 0 days 00:00:04
           ... 5 more lines ...
98 2023-07-04 11:37:21.781 0 days 00:01:39
99 2023-07-04 11:35:53.781 0 days 00:01:40

[100 rows x 2 columns]

julia> jdf2 = DataFrame(PyTable(pdf))
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Compound…   
─────┼──────────────────────────────────────
   1 │ 2023-07-04T11:29:27.781  1 second
   2 │ 2023-07-04T11:30:02.781  2 seconds
   3 │ 2023-07-04T11:40:17.781  3 seconds
  ⋮  │            ⋮                  ⋮
  99 │ 2023-07-04T11:37:21.781  99 seconds
 100 │ 2023-07-04T11:35:53.781  100 seconds
                             95 rows omitted

hhaensel avatar Jul 04 '23 09:07 hhaensel

This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.

github-actions[bot] avatar Aug 22 '23 01:08 github-actions[bot]

@cjdoris I think this is still relevant

hhaensel avatar Aug 22 '23 06:08 hhaensel

@cjdoris I think this is still relevant

I agree. Just stumbled on this when using pandas functionality to convert (legacy) xls data to xlsx. The dates became PyArray{UInt8}.

klwlevy avatar Aug 28 '23 15:08 klwlevy

I'd be ready to adapt the my PR to the latest changes of PythonCall. Should I continue to work on it? Is the refactoring complete or should I still wait a bit?

hhaensel avatar Jun 10 '24 21:06 hhaensel

Yep the refactor is done. Can you clarify what the PR will change?

cjdoris avatar Jun 13 '24 07:06 cjdoris

I think I summarised everything quite nicely above (https://github.com/JuliaPy/PythonCall.jl/issues/293#issuecomment-1619897477)

The related PR is #334, where you commented that you are refactoring.

hhaensel avatar Jun 14 '24 06:06 hhaensel

@cjdoris After refactoring I decided that it is easier to submit a new PR #509

hhaensel avatar Jun 17 '24 06:06 hhaensel