polars
polars copied to clipboard
feat(python,rust): Typed JsonPath implementation
This PR resolves #3373. The PR is currently a work-in-progress, based on upstream changes in arrow2 (https://github.com/jorgecarleitao/arrow2/pull/989). The Cargo.toml files have been modified for now to use the latest version from a fork, which will be reverted once the upstream PR merges. This needs some additional work to expose the feature to Python. The Rust pieces are working correctly, although there is certainly room for better testing and documentation. A new example in examples/json_path
shows off what can be done at present.
Codecov Report
Merging #3413 (e1876e9) into master (2243ecf) will increase coverage by
0.11%
. The diff coverage is33.33%
.
@@ Coverage Diff @@
## master #3413 +/- ##
==========================================
+ Coverage 77.41% 77.52% +0.11%
==========================================
Files 402 408 +6
Lines 69508 69960 +452
==========================================
+ Hits 53808 54237 +429
- Misses 15700 15723 +23
Impacted Files | Coverage Δ | |
---|---|---|
...polars-core/src/chunked_array/strings/json_path.rs | 21.66% <0.00%> (-54.81%) |
:arrow_down: |
py-polars/polars/__init__.py | 100.00% <ø> (ø) |
|
py-polars/src/series.rs | 75.00% <0.00%> (-1.69%) |
:arrow_down: |
py-polars/src/lib.rs | 93.51% <16.66%> (-1.27%) |
:arrow_down: |
py-polars/polars/internals/series.py | 95.27% <50.00%> (-0.39%) |
:arrow_down: |
py-polars/src/conversion.rs | 78.27% <64.28%> (-2.72%) |
:arrow_down: |
py-polars/polars/datatypes.py | 92.00% <72.22%> (-2.65%) |
:arrow_down: |
polars/polars-core/src/datatypes.rs | 69.96% <100.00%> (+0.19%) |
:arrow_up: |
polars/polars-lazy/src/logical_plan/mod.rs | 97.59% <0.00%> (-0.61%) |
:arrow_down: |
polars/polars-lazy/src/logical_plan/aexpr.rs | 80.72% <0.00%> (-0.61%) |
:arrow_down: |
... and 42 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2243ecf...e1876e9. Read the comment docs.
I'm interested in feedback on the function names. Here is a summary (omitting Option/Result as they are handled slightly differently in Py/Rs):
-
json_infer(number_of_rows) -> DataType
- infers the Arrow schema for JSON values, 1 schema for all rows, up to number_of_rows -
json_extract(dtype) -> Series
- deserializes JSON values per row using a common schema -
json_path_select(path) -> Uft8Chunked
- selects JSON sub-sections using a JsonPath and returns a full string representation -
json_path_extract(path, dtype) -> Series
- deserializes JSON values after selection using a JsonPath
My thought is to deprecate json_path_match
and keep it's behavior consistent to avoid introducing issues. I tried to get the above naming to be self-consistent, and preferred extract
over deserialize
for brevity.
@ritchie46, what do you think about the function names?
I'm still working on the Lazy/Expr API for this feature, and adding tests. The PR currently works end-to-end for these operations directly on a Series.
One alternative to consider for the above function names is using load
instead of extract
=> json_load
and json_path_load
. This is more consistent with the json.loads
function in Python.
One alternative to consider for the above function names is using
load
instead ofextract
=>json_load
andjson_path_load
. This is more consistent with thejson.loads
function in Python.
as someone coming from a JS background, i've always found the python loads
& dumps
methods very unintuitive. I think extract
or parse
are much more self-explanitory
I'm wondering, what is the benefit of exposing the json_infer
method? could we not do something similar to the read_* implementations. & take a schema or an infer_count
perhaps a signature of
json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])
as someone coming from a JS background, i've always found the python
loads
&dumps
methods very unintuitive. I thinkextract
orparse
are much more self-explanitory
Thanks. As someone who mostly develops in Python, I also find the naming unintuitive. I'll leave it as extract
unless there is further feedback.
I'm wondering, what is the benefit of exposing the
json_infer
method? could we not do something similar to the read_* implementations. & take a schema or an infer_count perhaps a signature of
json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])
Exposing thejson_infer
method allows for exploration and performance use cases. The json_path_extract
function uses all N rows to infer the dtype by default, which is most likely to work and is generally useful. However json_infer
allows you to get the dtype for a limited number of rows -- calling that manually first allows you to pass in the optional dtype to json_path_extract
or json_extract
. I think the current API gives broad convenience, while allowing more specific use cases.
I'm getting back to this PR after some time. It needs updates in arrow2 and json_deserialize to fix jorgecarleitao/json-deserializer#13. There are also conflicts from the dispatch changes in #4423 that need to be addressed.
I've broken out the initial Rust functions into a different PR (#5140) to make this easier to review.
Superseded by https://github.com/pola-rs/polars/pull/5140 and https://github.com/pola-rs/polars/pull/6885 (both of which have been merged).