polars icon indicating copy to clipboard operation
polars copied to clipboard

RFC: provide explicit functions for json operations

Open universalmind303 opened this issue 2 years ago • 10 comments

Describe your feature request

There are 3 distinct types of json operations we are using within the python apis

  • Serde
  • JSON
  • NDJSON

I propose we explicitly separate the logic for these.

the inverse of the discussion in #3900 around describe_optimized_plan the json operations are too ambigious to easily understand what the function is doing, and different parameters can greatly change the behavior of the function. I think we should split these out into more explicit function calls to make the code more readable.

The serde json operations should be moved under a .serialize() method. ndjson moved to *_json_lines. and the most common row_oriented json remains under *_json()

refactor:

  • LazyFrame.write_json -> LazyFrame.serialize(format='json')
  • LazyFrame.read_json -> LazyFrame.deserialize(format='json')
  • LazyFrame.scan_json -> LazyFrame.scan_json_lines
  • DataFrame.write_json() -> DataFrame.serialize(format='json')
  • DataFrame.write_json(row_oriented=True) -> DataFrame.write_json()
  • DataFrame.write_json(json_lines=True) -> DataFrame.write_json_lines() Alternatively write_ndjson

A .serialize method could also be used to support many other common serde formats. (bincode, msgpack, etc.)

.serialize(format='json') .serialize(format='msgpack') .serialize(format='bincode')

universalmind303 avatar Aug 13 '22 18:08 universalmind303

I think that make sense. I do want to add to the docstrings that serialize(format="json") should be preferred if performance is important. We serialize column oriented, so that should be much, much faster.

Shall we rename json_lines to ndjson?

ritchie46 avatar Aug 15 '22 07:08 ritchie46

Shall we rename json_lines to ndjson?

i think both terms are interchangeable, just a matter of preference i guess.

universalmind303 avatar Aug 15 '22 07:08 universalmind303

Regarding handling JSON and NDJSON in separate functions:

I don't have a real sense of how 'similar' JSON and NDJSON are. I know pandas handles this as read_json(lines=True). We do the same for reading eagerly, but in LazyFrame we have separate functions for JSON/NDJSON. We should at least be internally consistent. Your proposal to handle them separately seems sensible to me. I am working on implementing this in my PR here: #4721 .

Regarding serialize/deserialize: I don't think I understand memory IO mechanics well enough to have a strong opinion here. In the code base currently, it seems we call this from_x/to_x instead of deserialize/serialize. I think that convention is generally OK. Using your proposal would create an umbrella function for all these, which could be nice. There are pros and cons. I don't feel very strongly about this.

I do think it's interesting that for JSON/NDJSON, you are proposing to 'break up' an umbrella function into separate functions, where for the other part you're proposing to create an umbrella function to incorporate various Serde actions. That seems inconsistent.

stinodego avatar Sep 04 '22 16:09 stinodego

We should separate them. There will be an scan_ndjson, but no scan_json. The NDJSON format is suited for our optimizations, where json is not. We need to scan the whole JSON to resolve the contexts.

Another reason I think we should break them up, is that they are two different formats. I'd like the explicitly of that.

I do think it's interesting that for JSON/NDJSON, you are proposing to 'break up' an umbrella function into separate functions, where for the other part you're proposing to create an umbrella function to incorporate various Serde actions. That seems inconsistent.

Yes, I agree we can make these explicit functions, even though it boils down to using the Serde architecture internally. For the python users this should be an implementation detail.

ritchie46 avatar Sep 04 '22 17:09 ritchie46

If I understand correctly, LazyFrame.from_json and LazyFrame.to_json are not really lazy operations. Should these methods even exist? The following API would be consistent with the rest of the code base:

File input:

  • pl.read_json -> Read JSON eagerly, returns DataFrame (user must call .lazy() to proceed in lazy mode)
  • pl.read_ndjson -> Read NDJSON eagerly, returns DataFrame
  • pl.scan_ndjson -> Read NDJSON lazily, returns LazyFrame

File output (user must first call .collect() if in lazy mode):

  • DataFrame.write_json -> Write to JSON file.
  • DataFrame.write_ndjson -> Write to NDJSON file.
    • If I understand correctly, we have no way to write NDJSON lazily without collecting first. If we do, we would need a LazyFrame.write_ndjson too.

Object input (user must call .lazy() to proceed in lazy mode):

  • DataFrame(data=str) -> New DataFrame constructor logic. If input data is a string, we will try to parse it as JSON.

Object output (user must first call .collect() if in lazy mode):

  • DataFrame.to_json -> Convert to JSON string.
    • Currently, this is implemented as DataFrame.write_json(file=None). That is consistent with DataFrame.to_csv. I think that's quite nice. DataFrame.to_json could simply be an alias for this.
  • DataFrame.to_ndjson -> Convert to NDJSON string.

Thoughts?

stinodego avatar Sep 05 '22 01:09 stinodego

I think those two operations are definitely valuable. I agree that they are not 'technically' lazy operations. That was my initial reasoning behind the de/serialize naming convention. It made it clear that it was constructing the lazyframe from a serialized json representation.

im not super familiar with the python dunder methods, but if there is something that we could leverage there that works with json.dumps / json.loads maybe we could go that route instead?

universalmind303 avatar Sep 05 '22 18:09 universalmind303

I think those two operations are definitely valuable. I agree that they are not 'technically' lazy operations.

Then wouldn't you agree it makes sense to construct a DataFrame instead, and call .lazy() on it? Or to call .collect() and then write?

The LazyFrame isn't set up to be constructed directly. Currently, LazyFrame.from_json is the ONLY reason to import LazyFrame directly (that I can see). This is very inconsistent with the rest of the code.

stinodego avatar Sep 05 '22 18:09 stinodego

the lazyframe json methods provide a functionality that doesnt exist in eager. It allows us to instantiate an in memory lazyframe from an a plan, instead of an eager dataset.

with the to/from json methods, we can pass off the logicalplan to any destination. For example, you could use this functionality to execute lazyframes on a remote server. Using your client to construct the lazyframe, the passing off the serialized frame to the server for execution. There are many potential use cases that this functionality allows.

The lazy representation of an eager frame (via df.lazy()) contains the entirety of the serialized dataframe, which obviously does not scale well if you are trying to send that via http or other protocols. The lazy representation of something using scan_* is much more lightweight though.

import polars as pl
eager_lf = pl.DataFrame({
  'foo': [1, 2, 3, 4, 5],
  'bar': [5, 4, 3, 2, 1]
}).lazy()

>>> eager_lf.write_json()
{
  "DataFrameScan": {
    "df": {
      "columns": [
        { "name": "foo", "datatype": "Int64", "values": [1, 2, 3, 4, 5] },
        { "name": "bar", "datatype": "Int64", "values": [5, 4, 3, 2, 1] }
      ]
    },
    "schema": { "inner": { "foo": "Int64", "bar": "Int64" } },
    "projection": null,
    "selection": null
  }
}

serializing a lazyframe that is not constructed via .lazy() gives us a json payload only contains the instructions, and would easily be shareable across servers/platforms/languages.

>>> pl.scan_csv('./data.csv').write_json()
{
  "CsvScan": {
    "path": "./data.csv",
    "schema": { "inner": { "foo": "Int64", "bar": "Utf8" } },
    "options": {
      "delimiter": 44,
      "comment_char": null,
      "quote_char": 34,
      "eol_char": 10,
      "has_header": true,
      "skip_rows": 0,
      "n_rows": null,
      "with_columns": null,
      "low_memory": false,
      "ignore_errors": false,
      "cache": true,
      "null_values": null,
      "rechunk": true,
      "encoding": "Utf8",
      "row_count": null,
      "parse_dates": false,
      "file_counter": 0
    },
    "predicate": null,
    "aggregate": []
  }
}

The potential applications that you can build on top of this functionality could be quite powerful & I don't think it is something that should be removed. I do agree that it is inconsistent with the rest of the code. Perhaps there is an alternative solution that provides the same functionality, and is consistent with the codebase?

universalmind303 avatar Sep 05 '22 22:09 universalmind303

I agree with @universalmind303 here of the utility and huge potential of serializing the logical plan.This would allow us to scale up by running polars on ray or dask.

But I think we should utilize the python architecture for serialization for this. What I mean is that read_*, scan_* and write_* are for file formats that hold tabular data.

For serializing the logical plan. I'd like that pickle.dump/load and json.dump/load work as expected. Where pickle chooses a fast serialization format like bincode.

ritchie46 avatar Sep 07 '22 07:09 ritchie46

Thanks for the explanations! I am starting to understand this whole thing a lot better now. I misunderstood what the LazyFrame methods actually did; they operate on logical plans, not on actual data.

That means the current docstrings and deprecation message are incorrect/confusing; we should fix those.

I'm thinking about the best way to handle this in the Python API, haven't figured it out yet...

stinodego avatar Sep 09 '22 03:09 stinodego

Functionality for reading/writing NDJSON has been split into separate functions as read_ndjson and DataFrame.write_ndjson. There is no scan_ndjson (yet).

But I think we should utilize the python architecture for serialization for this. What I mean is that read_, scan_ and write_* are for file formats that hold tabular data.

What I understand from this is that renaming LazyFrame.read/write_json to LazyFrame.serialize/deserialize as proposed would be appropriate, as the names read_* / scan_* / write_* are reserved for tabular data. A format argument could be added later if we support other formats.

I'll open a PR for this change.

For serializing the logical plan. I'd like that pickle.dump/load and json.dump/load work as expected. Where pickle chooses a fast serialization format like bincode.

We can pick this up as a separate issue.

stinodego avatar Aug 02 '23 05:08 stinodego