polars icon indicating copy to clipboard operation
polars copied to clipboard

feat(python,rust): Typed JsonPath implementation

Open cjermain opened this issue 2 years ago • 9 comments

This PR resolves #3373. The PR is currently a work-in-progress, based on upstream changes in arrow2 (https://github.com/jorgecarleitao/arrow2/pull/989). The Cargo.toml files have been modified for now to use the latest version from a fork, which will be reverted once the upstream PR merges. This needs some additional work to expose the feature to Python. The Rust pieces are working correctly, although there is certainly room for better testing and documentation. A new example in examples/json_path shows off what can be done at present.

cjermain avatar May 17 '22 00:05 cjermain

Codecov Report

Merging #3413 (e1876e9) into master (2243ecf) will increase coverage by 0.11%. The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #3413      +/-   ##
==========================================
+ Coverage   77.41%   77.52%   +0.11%     
==========================================
  Files         402      408       +6     
  Lines       69508    69960     +452     
==========================================
+ Hits        53808    54237     +429     
- Misses      15700    15723      +23     
Impacted Files Coverage Δ
...polars-core/src/chunked_array/strings/json_path.rs 21.66% <0.00%> (-54.81%) :arrow_down:
py-polars/polars/__init__.py 100.00% <ø> (ø)
py-polars/src/series.rs 75.00% <0.00%> (-1.69%) :arrow_down:
py-polars/src/lib.rs 93.51% <16.66%> (-1.27%) :arrow_down:
py-polars/polars/internals/series.py 95.27% <50.00%> (-0.39%) :arrow_down:
py-polars/src/conversion.rs 78.27% <64.28%> (-2.72%) :arrow_down:
py-polars/polars/datatypes.py 92.00% <72.22%> (-2.65%) :arrow_down:
polars/polars-core/src/datatypes.rs 69.96% <100.00%> (+0.19%) :arrow_up:
polars/polars-lazy/src/logical_plan/mod.rs 97.59% <0.00%> (-0.61%) :arrow_down:
polars/polars-lazy/src/logical_plan/aexpr.rs 80.72% <0.00%> (-0.61%) :arrow_down:
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2243ecf...e1876e9. Read the comment docs.

codecov-commenter avatar May 17 '22 00:05 codecov-commenter

I'm interested in feedback on the function names. Here is a summary (omitting Option/Result as they are handled slightly differently in Py/Rs):

  • json_infer(number_of_rows) -> DataType - infers the Arrow schema for JSON values, 1 schema for all rows, up to number_of_rows
  • json_extract(dtype) -> Series - deserializes JSON values per row using a common schema
  • json_path_select(path) -> Uft8Chunked - selects JSON sub-sections using a JsonPath and returns a full string representation
  • json_path_extract(path, dtype) -> Series - deserializes JSON values after selection using a JsonPath

My thought is to deprecate json_path_match and keep it's behavior consistent to avoid introducing issues. I tried to get the above naming to be self-consistent, and preferred extract over deserialize for brevity.

@ritchie46, what do you think about the function names?

I'm still working on the Lazy/Expr API for this feature, and adding tests. The PR currently works end-to-end for these operations directly on a Series.

cjermain avatar May 23 '22 11:05 cjermain

One alternative to consider for the above function names is using load instead of extract => json_load and json_path_load. This is more consistent with the json.loads function in Python.

cjermain avatar May 24 '22 11:05 cjermain

One alternative to consider for the above function names is using load instead of extract => json_load and json_path_load. This is more consistent with the json.loads function in Python.

as someone coming from a JS background, i've always found the python loads & dumps methods very unintuitive. I think extract or parse are much more self-explanitory

universalmind303 avatar May 25 '22 16:05 universalmind303

I'm wondering, what is the benefit of exposing the json_infer method? could we not do something similar to the read_* implementations. & take a schema or an infer_count perhaps a signature of

json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])

universalmind303 avatar May 25 '22 16:05 universalmind303

as someone coming from a JS background, i've always found the python loads & dumps methods very unintuitive. I think extract or parse are much more self-explanitory

Thanks. As someone who mostly develops in Python, I also find the naming unintuitive. I'll leave it as extract unless there is further feedback.

cjermain avatar May 26 '22 11:05 cjermain

I'm wondering, what is the benefit of exposing the json_infer method? could we not do something similar to the read_* implementations. & take a schema or an infer_count perhaps a signature of

json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])

Exposing thejson_infer method allows for exploration and performance use cases. The json_path_extract function uses all N rows to infer the dtype by default, which is most likely to work and is generally useful. However json_infer allows you to get the dtype for a limited number of rows -- calling that manually first allows you to pass in the optional dtype to json_path_extract or json_extract. I think the current API gives broad convenience, while allowing more specific use cases.

cjermain avatar May 26 '22 11:05 cjermain

I'm getting back to this PR after some time. It needs updates in arrow2 and json_deserialize to fix jorgecarleitao/json-deserializer#13. There are also conflicts from the dispatch changes in #4423 that need to be addressed.

cjermain avatar Sep 25 '22 00:09 cjermain

I've broken out the initial Rust functions into a different PR (#5140) to make this easier to review.

cjermain avatar Oct 07 '22 23:10 cjermain

Superseded by https://github.com/pola-rs/polars/pull/5140 and https://github.com/pola-rs/polars/pull/6885 (both of which have been merged).

alexander-beedie avatar Mar 13 '23 09:03 alexander-beedie