duckdb-wasm icon indicating copy to clipboard operation
duckdb-wasm copied to clipboard

Wrong Date type check for Arrow tables

Open mootari opened this issue 1 year ago • 10 comments

What happens?

Dates with millisecond resolution in Arrow tables are misdetected asdate instead of date64[ms].

When an Arrow table is created via

arrow.tableFromJSON([{date: new Date()}])

the infered type is Date (typeId 8) with a unit of 1 (source). However, arrowToSQLType() checks against DateMillisecond(typeId -14) (source).

(It is unclear whether this is a bug in the Arrow implementation or DuckDB integration.)

To Reproduce

observablehq.com example, via DuckDBClient:

{
  const arrow = await import("https://cdn.observableusercontent.com/npm/[email protected]/+esm");
  const table = arrow.tableFromJSON([{date: new Date("2023-02-01T12:34:56Z")}]);
  const type = table.schema.fields[0].type;
  const db = await DuckDBClient.of({t: table});
  const result = (await db.queryRow("select date d from t")).d.toISOString();
  return {result, type, typeId: type.typeId};
}

Browser/Environment:

Chrome 111

Device:

macOS 13.12.1

DuckDB-Wasm Version:

1.24

DuckDB-Wasm Deployment:

observablehq.com

Full Name:

Fabian Iwand

Affiliation:

Observable

mootari avatar Apr 04 '23 12:04 mootari

I'm happy to help with getting a fix into arrow if needed.

domoritz avatar Apr 07 '23 16:04 domoritz

It would be very welcome, I am not familiar yet with the Arrow conversion code and it's taking way too long to investigate this.

carlopi avatar Apr 07 '23 18:04 carlopi

I don't have time to investigate unfortunately but if there is a bug in Arrow, I can help with that.

domoritz avatar Apr 07 '23 18:04 domoritz

Thanks a lot! Will update with what I find.

carlopi avatar Apr 07 '23 18:04 carlopi

According to the enum inline docs any type values outside 0-17 are only used within TypeScript.

Am I reading that correctly to mean that the current behavior in Arrow works as designed, and that DuckDB would have to check the type's unit prop against DateUnit.DAY and DateUnit.MILLISECOND?

mootari avatar Apr 07 '23 20:04 mootari

Hi, is there anything I can do to help?

mootari avatar Apr 17 '23 15:04 mootari

Hi there. Any updates on this? I believe I just ran into the same issue.

dunstan avatar Oct 18 '23 02:10 dunstan

With a small repro case here, it look like this is a bug in Arrow JS, where it's incorrectly inferring a date type instead of datetime.

kylebarron avatar Dec 14 '23 18:12 kylebarron

Let's fix it upstream. I am happy to help get a fix merged. Who wants to make a pull request?

domoritz avatar Dec 14 '23 20:12 domoritz

I'm not in any position to make the pull request, @domoritz , but I'm pretty sure the problem is at https://github.com/apache/arrow/blob/6c326db6a5686a78bc77be662b61236ddbfc66dc/js/src/factories.ts#L148C27-L148C42; I'm assuming DateMillisecond (which feels like a trap of a type, since being a Date kind of makes millisecond accuracy unimportant?) should really be TimestampMillisecond.

marvold-mw avatar Dec 19 '23 19:12 marvold-mw

I'm trying to investigate where the issue is.

const date = new Date("2023-02-01T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.type)
console.log(vec.get(0)?.toISOString())

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.TimestampMillisecond)
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0).toISOString())

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0).toISOString())

outputs (as of 96f686b81)

original 2023-02-01T12:34:56.000Z
=> vec
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z
=> vec2
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
1675254896000
=> table
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z
=> table2
Date_ {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-02-01T12:34:56.000Z

Looking at the schema docs in https://github.com/apache/arrow/blob/main/format/Schema.fbs it seems like DateMillisecond (date with unit milliseconds) is the right type since js dates are timezone agnostic. However, the Rust docs clarify that date should only encode the days and not the time. This seems like a compatibility type and we should discourage it similar to how rust did it recently: https://github.com/apache/arrow-rs/issues/5288. So yeah, I think we should use TimestampMillisecond with no timezone.

domoritz avatar Mar 29 '24 14:03 domoritz

Sent a fix in https://github.com/apache/arrow/pull/40892. Please take a look to see whether it now does what you expect.

domoritz avatar Mar 29 '24 15:03 domoritz

Sent a fix in apache/arrow#40892. Please take a look to see whether it now does what you expect.

While I'm not the original reporter, your fix does do what I would expect when using JavaScript types with Arrow and (via Arrow) with DuckDB. The TIMESTAMP type in DuckDB now comes back, which seems like the correct mapping for JS Date semantics. And (as your unit tests confirm) Arrow converts them back to JavaScript Dates just fine.

Thanks for the fix, Dom!

marvold-mw avatar Mar 29 '24 20:03 marvold-mw

Fixed in https://github.com/apache/arrow/issues/40891.

domoritz avatar Apr 03 '24 22:04 domoritz