polars icon indicating copy to clipboard operation
polars copied to clipboard

Recognize integer timestamps when `dtypes=pl.Datetime` is passed to `read_csv`

Open DrMaphuse opened this issue 2 years ago • 4 comments

Correctly parsing timestamp columns stored as integer currently requires us to read them as int and cast them to datetime explicitly:

from io import StringIO

csv = StringIO("""a
1659606994266166
""")
pl.read_csv(csv, dtypes=[pl.Int64]).with_column(pl.col('a').cast(pl.Datetime('us')))

If we try to pass the pl.Datetime dtype directly to read_csv, the value is read as Null

pl.read_csv(csv, dtypes=[pl.Datetime('us')])

shape: (1, 1)
┌──────────────┐
│ a            │
│ ---          │
│ datetime[μs] │
╞══════════════╡
│ null         │
└──────────────┘

It would be neat if polars could include a check for the integer format in its parser. This would bring a big improvement in the ability to automate CSV processing.

DrMaphuse avatar Aug 04 '22 09:08 DrMaphuse

This is not something we will do. IMO the csv parser should parse the data as is. The subsequent DataFrame library is for the cleaning/casting/preparing.

And as your example shows. We provide sufficient tools to handle you data as needed.

ritchie46 avatar Aug 04 '22 10:08 ritchie46

IMO the csv parser should parse the data as is.

I don't see this as a contradiction. Timestamps are de facto integers in arrow:

Arrow timestamps are stored as a 64-bit integer with column metadata to associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an optional time zone.

I assume that this is true for polars as well?

The integer format seems closer to "as is" than a strftime formatted string. We're just telling the parser from the get-go that this column should be represented as timestamps.

From another perspective: The reason for the feature request was the need to process files with such Timestamps automatically, without knowing whether to expect an integer or a formatted string. This is currently only possible via complex workarounds.

DrMaphuse avatar Aug 04 '22 11:08 DrMaphuse

The integer format seems closer to "as is" than a strftime formatted string. We're just telling the parser from the get-go that this column should be represented as timestamps.

What you are requesting is adding a lot of extra complexity in the parsers. And I am concerned where this stops.

You are basically pushing the compute engine into the parser. As we still need to parse as integers and then cast them to Datetime.

The workaround doesn't have to be complex right?

csv = """a,b,c
2022-1-1,1000,a,
2022-2-1,2000,a,
"""

schema = {
    "a": pl.Datetime("us"),
    "b": pl.Datetime("us"),
    "c": pl.Utf8,
}

df = pl.read_csv(csv.encode(), parse_dates=True)

df.with_columns([
    pl.col(name).cast(dt) for name, dt in schema.items()
])
shape: (2, 3)
┌─────────────────────┬─────────────────────────┬─────┐
│ a                   ┆ b                       ┆ c   │
│ ---                 ┆ ---                     ┆ --- │
│ datetime[μs]        ┆ datetime[μs]            ┆ str │
╞═════════════════════╪═════════════════════════╪═════╡
│ 2022-01-01 00:00:00 ┆ 1970-01-01 00:00:00.001 ┆ a   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-02-01 00:00:00 ┆ 1970-01-01 00:00:00.002 ┆ a   │
└─────────────────────┴─────────────────────────┴─────┘

ritchie46 avatar Aug 04 '22 11:08 ritchie46

Thanks for the thorough response! I totally understand that it doesn't make sense if it adds too much complexity.

The suggested workaround is neat, but it relies on schema inference, which often fails or requires a whole read for the data I'm working with (I often get string values after tens of millions of numerical / empty rows).

The (rather manual) approach I'm currently taking is this:

csv = """a,b,c
2022-1-1,1000,a,
2022-2-1,2000,a,
"""

schema = {
    "a": pl.Date,
    "b": pl.Int64,
    "c": pl.Utf8,
}

timestamp_cols = ["b"]

df = pl.read_csv(csv.encode(), dtypes=schema)

df.with_columns([
    pl.col(name).cast(pl.Datetime("us")) for name in timestamp_cols
])

But since I'm working with hundreds of constantly changing columns, this involves a lot of extra work.

DrMaphuse avatar Aug 04 '22 16:08 DrMaphuse

Related PR: #8134

@ritchie46, I read what you said at Aug 4, but I think it is incomplete. Let me explain:

In csv.rs there is a test test_with_dtype. In there, you state that the dtype for column "b" is a Datetime("ns"). It is verified that the dtype is set correctly, however this leaves the impression that this behaviour is supported (while it is not -- the dtype of the column is effectively DT, but the values are null).

What I think is also illogical is that if you read in the csv file without dtypes it will give you an i64 column, which you can perfectly cast to a datetime column... Without any complications.

So, why can't we directly do this with the dtypes parameter? Seems illogical...

I was trying something in this branch here: https://github.com/pola-rs/polars/compare/main...svaningelgem:polars:make-csv-understand-i64-directly?expand=1 (not cleaned up yet -- just wanted to know if this is the correct way to go, or I should just abandon the idea altogether?)

svaningelgem avatar Oct 12 '23 06:10 svaningelgem