polars Missing documentation for how various formats are turned into polars dataframes

Missing documentation for how various formats are turned into polars dataframes

Open michaeleisel opened this issue 1 year ago • 1 comments

Description

Polars has great support for lots of different formats, and it seems like it has picked some reasonable ways of turning those formats into dataframes. It would be great to document these choices, and the principles behind them, in some sort of way. One principle I've heard from others is that polars always losslessly converts various data types into their internal formats. This is a great principle that can answer many questions, but still leaves some areas of murkiness that would be good to document. I think it would also be good for the sake of explicitness to document even trivial conversions, just so the user is clear (e.g., a JSON string being turned into a polars string). But here are some examples of questions that maybe have less obvious answers:

Dealing with numbers in JSON is a notoriously tricky thing, with no perfect answer as to how to do it. RFC 7159, states that parsers need only convert them to doubles: "since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide". However, some popular parsers can optionally represent them as 64-bit unsigned or signed integers. So, what has polars chosen here? The simplicity of making everything a double, or some sort of more sophisticated logic, e.g., if there are no decimal points in a column, and the numbers are large enough to need a 64-bit unsigned integer, it uses unsigned integers for the column? If it has chosen to use doubles everywhere, will it crash if someone tries to pass a large unsigned 64-bit integer to it that can't be losslessly represented as a double?
In polars' JSON guide, it doesn't mention the actual input structure of the JSON. For example, does it support the form {"column1": [1, 2], "column2": ["a", "b"]}? What about the form [{"column1": 1, "column2": "a"}, {"column1": 2, "column2": "b"}]?
How does polars deal with undefined in JSON? Does it treat those values the same way as it treats null?
When reading in a Parquet file (spec) with a TIMESTAMP, what happens when the TIMESTAMP is in nanoseconds rather than microseconds? Does it losslessly convert it to microseconds if possible? Or does it always crash? Does it behave differently when isAdjustedToUTC is true vs. false?

What I would love to see, personally, is a table listing out each data type of each supported format and how it gets mapped to a polars data type. This is not at all a criticism of how polars' converts from input data types to polars data types, I just think it would be great to add more docs explaining it to newcomers like myself.

Link

No response

Feb 22 '24 14:02 michaeleisel

Interestingly, polars seems to handle timestamps that aren't losslessly convertible into a timestamp with microseconds. Here we have a dataframe that I made in parquet with a timestamp value of 1 nanosecond:

>>> pl.read_parquet('a.parquet')
shape: (1, 1)
┌───────────────────────────────┐
│ datetime_ns                   │
│ ---                           │
│ datetime[ns]                  │
╞═══════════════════════════════╡
│ 1970-01-01 00:00:00.000000001 │
└───────────────────────────────┘

So, I wonder if there's an inaccuracy in https://docs.pola.rs/user-guide/concepts/data-types/overview/ when it describes Datetime as "internally represented as microseconds"

Feb 23 '24 13:02 michaeleisel

polars polars copied to clipboard

Missing documentation for how various formats are turned into polars dataframes

Description

Link

polars
polars copied to clipboard