DataTables.jl
DataTables.jl copied to clipboard
`==` does not compare columns of `ZonedDateTime`s correctly
Comparing two ZonedDateTime
s that represent the same "instant" (but in different time zones) with ==
returns true
, but comparing them with isequal
returns false.
julia> using TimeZones, DataFrames, DataTables
julia> ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")) == ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC"))
true
julia> isequal(ZonedDateTime(2016, 1, 1, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")))
false
DataFrames.jl maintains this convention:
julia> using TimeZones, DataFrames
julia> df_1 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date │
├─────┼────┼───────────────────────────┤
│ 1 │ 1 │ 2016-01-01T00:00:00-06:00 │
│ 2 │ 2 │ 2016-01-01T01:00:00-06:00 │
julia> df_2 = DataFrame(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataFrames.DataFrame
│ Row │ id │ date │
├─────┼────┼───────────────────────────┤
│ 1 │ 1 │ 2016-01-01T06:00:00+00:00 │
│ 2 │ 2 │ 2016-01-01T07:00:00+00:00 │
julia> df_1 == df_2
true
julia> isequal(df_1, df_2)
false
...but DataTables.jl doesn't:
julia> using TimeZones, DataTables
julia> dt_1 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 0, TimeZone("America/Winnipeg")), ZonedDateTime(2016, 1, 1, 1, TimeZone("America/Winnipeg"))])
2×2 DataTables.DataTable
│ Row │ id │ date │
├─────┼────┼───────────────────────────┤
│ 1 │ 1 │ 2016-01-01T00:00:00-06:00 │
│ 2 │ 2 │ 2016-01-01T01:00:00-06:00 │
julia> dt_2 = DataTable(id=[1,2], date=[ZonedDateTime(2016, 1, 1, 6, TimeZone("UTC")), ZonedDateTime(2016, 1, 1, 7, TimeZone("UTC"))])
2×2 DataTables.DataTable
│ Row │ id │ date │
├─────┼────┼───────────────────────────┤
│ 1 │ 1 │ 2016-01-01T06:00:00+00:00 │
│ 2 │ 2 │ 2016-01-01T07:00:00+00:00 │
julia> dt_1 == dt_2
false
It's no real mystery why, given the fairly terse definition of ==
:
@compat(Base.:(==))(dt1::AbstractDataTable, dt2::AbstractDataTable) = isequal(dt1, dt2)
I think that supporting ==
comparisons (rather than just doing isequals
all the way down) would be preferable in this case.
Version information:
julia> Pkg.status("DataTables")
- DataTables 0.0.3
julia> versioninfo()
Julia Version 0.6.0-rc3.0
Commit ad290e93e4* (2017-06-07 11:53 UTC)
Yeah definitely. isqual
and ==
are separate functions in Base for a reason.
I'm sure I can get a PR in for this in fairly short order.
That would be fantastic. Thanks!
Well, I figured out why DataTables.jl has just been using isequal
.
This is actually more complex to solve than I initially anticipated, owing to the fact that ==
checks between NullableArray
s are broken.
julia> using NullableArrays
julia> a = NullableArray(1:3)
3-element NullableArrays.NullableArray{Int64,1}:
1
2
3
julia> b = NullableArray(1:3)
3-element NullableArrays.NullableArray{Int64,1}:
1
2
3
julia> a == b
ERROR: TypeError: non-boolean (Nullable{Bool}) used in boolean context
Stacktrace:
[1] ==(::NullableArrays.NullableArray{Int64,1}, ::NullableArrays.NullableArray{Int64,1}) at ./abstractarray.jl:1527
This, in turn, is because ==
comparisons between Nullable
s return Nullable{Bool}
, rather than Bool
.
In my opinion, the best fix for this would be to provide ==
for NullableArrays
and work with that. There was a PR to fix this in 2015, but it was never merged: https://github.com/JuliaStats/NullableArrays.jl/pull/84
I think I'm going to go ahead and take a shot at a PR here, but I suspect it isn't going to be pretty.
The problem is with Nullable
, and fixing it in NullableArrays would require type piracy. ==
with NullableArray is kinda forced to be inconsistent or not to work at all because ==
throws an errror for Nullable
in Base. The solution to this will be to move either to Union{T, Null}
(in DataFrames) or to DataValue{T}
.
I think the definition for ==
in DataValues.jl is ok at this point. I'm also in the process of adding a DataValueArray
that also fixes this, and then I'm also going to have a DataValueTable
that is based on that. Essentially that will be exactly the same design as the current DataTable
approach, except it will use DataValue
instead of Nullable
to get around the restrictions that we have due to Nullable
being in base and Nullable
not being special cased for the data science stack. I'm optimistic that I should be able to release soon, but on the flipside, classes start tomorrow, so who knows :)
Makes sense to me. I've made changes to my code that's working with DataTables to work around this issue for the moment, and I won't spend time trying to make ==
work as expected (at least until Nullable
s behave themselves a little better). Thanks, folks!