arrow-tools icon indicating copy to clipboard operation
arrow-tools copied to clipboard

Timestamp mapping

Open matasello opened this issue 7 months ago • 0 comments

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0
  1. First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field
    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
  1. Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

  1. Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64
┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │
  1. And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ? Thanks

matasello avatar Jul 04 '24 08:07 matasello