duckdb-wasm
duckdb-wasm copied to clipboard
Wrong Date type check for Arrow tables
What happens?
Dates with millisecond resolution in Arrow tables are misdetected asdate
instead of date64[ms]
.
When an Arrow table is created via
arrow.tableFromJSON([{date: new Date()}])
the infered type is Date
(typeId 8) with a unit
of 1
(source). However, arrowToSQLType()
checks against DateMillisecond
(typeId -14) (source).
(It is unclear whether this is a bug in the Arrow implementation or DuckDB integration.)
To Reproduce
observablehq.com example, via DuckDBClient:
{
const arrow = await import("https://cdn.observableusercontent.com/npm/[email protected]/+esm");
const table = arrow.tableFromJSON([{date: new Date("2023-02-01T12:34:56Z")}]);
const type = table.schema.fields[0].type;
const db = await DuckDBClient.of({t: table});
const result = (await db.queryRow("select date d from t")).d.toISOString();
return {result, type, typeId: type.typeId};
}
Browser/Environment:
Chrome 111
Device:
macOS 13.12.1
DuckDB-Wasm Version:
1.24
DuckDB-Wasm Deployment:
observablehq.com
Full Name:
Fabian Iwand
Affiliation:
Observable
I'm happy to help with getting a fix into arrow if needed.
It would be very welcome, I am not familiar yet with the Arrow conversion code and it's taking way too long to investigate this.
I don't have time to investigate unfortunately but if there is a bug in Arrow, I can help with that.
Thanks a lot! Will update with what I find.
According to the enum inline docs any type values outside 0-17 are only used within TypeScript.
Am I reading that correctly to mean that the current behavior in Arrow works as designed, and that DuckDB would have to check the type's unit
prop against DateUnit.DAY
and DateUnit.MILLISECOND
?
Hi, is there anything I can do to help?
Hi there. Any updates on this? I believe I just ran into the same issue.
With a small repro case here, it look like this is a bug in Arrow JS, where it's incorrectly inferring a date type instead of datetime.
Let's fix it upstream. I am happy to help get a fix merged. Who wants to make a pull request?
I'm not in any position to make the pull request, @domoritz , but I'm pretty sure the problem is at https://github.com/apache/arrow/blob/6c326db6a5686a78bc77be662b61236ddbfc66dc/js/src/factories.ts#L148C27-L148C42; I'm assuming DateMillisecond (which feels like a trap of a type, since being a Date kind of makes millisecond accuracy unimportant?) should really be TimestampMillisecond.
I'm trying to investigate where the issue is.
const date = new Date("2023-02-01T12:34:56Z");
console.log("original", date)
console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.type)
console.log(vec.get(0)?.toISOString())
console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.TimestampMillisecond)
console.log(vec2.type)
console.log(vec2.get(0))
console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0).toISOString())
console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0).toISOString())
outputs (as of 96f686b81)
original 2023-02-01T12:34:56.000Z
=> vec
DateMillisecond {
typeId: 8,
unit: 1,
toString: [Function: toString],
ArrayType: [class Int32Array],
[Symbol(Symbol.toStringTag)]: "Date",
children: null,
OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z
=> vec2
TimestampMillisecond {
typeId: 10,
unit: 1,
timezone: undefined,
toString: [Function: toString],
ArrayType: [class Int32Array],
[Symbol(Symbol.toStringTag)]: "Timestamp",
children: null,
OffsetArrayType: [class Int32Array],
}
1675254896000
=> table
DateMillisecond {
typeId: 8,
unit: 1,
toString: [Function: toString],
ArrayType: [class Int32Array],
[Symbol(Symbol.toStringTag)]: "Date",
children: null,
OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z
=> table2
Date_ {
typeId: 8,
unit: 1,
toString: [Function: toString],
ArrayType: [class Int32Array],
children: null,
OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z
Looking at the schema docs in https://github.com/apache/arrow/blob/main/format/Schema.fbs it seems like DateMillisecond (date with unit milliseconds) is the right type since js dates are timezone agnostic. However, the Rust docs clarify that date should only encode the days and not the time. This seems like a compatibility type and we should discourage it similar to how rust did it recently: https://github.com/apache/arrow-rs/issues/5288. So yeah, I think we should use TimestampMillisecond with no timezone.
Sent a fix in https://github.com/apache/arrow/pull/40892. Please take a look to see whether it now does what you expect.
Sent a fix in apache/arrow#40892. Please take a look to see whether it now does what you expect.
While I'm not the original reporter, your fix does do what I would expect when using JavaScript types with Arrow and (via Arrow) with DuckDB. The TIMESTAMP type in DuckDB now comes back, which seems like the correct mapping for JS Date semantics. And (as your unit tests confirm) Arrow converts them back to JavaScript Dates just fine.
Thanks for the fix, Dom!
Fixed in https://github.com/apache/arrow/issues/40891.