mongo-arrow icon indicating copy to clipboard operation
mongo-arrow copied to clipboard

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies

Open DataEnggNerd opened this issue 1 year ago • 5 comments

While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.

MongoDB documentation

[
    {
        "name": "test",
        "code": "1"
    },
    {
        "name": "test",
        "code": 1
    }
]

Current implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘

In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.

Expected implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query,
            coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# └─────────────────────────────────┴──────┴──────┘

Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field

DataEnggNerd avatar Oct 01 '24 14:10 DataEnggNerd

Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252

aclark4life avatar Oct 01 '24 19:10 aclark4life

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

DataEnggNerd avatar Oct 22 '24 07:10 DataEnggNerd

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Yes! Are you able to send a PR with the proposed changes?

aclark4life avatar Oct 22 '24 16:10 aclark4life

@aclark4life I would like to discuss the design before getting into implementation. In Jira I have observed that there is a suggestion of a new data type, which I am fine with. But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?

Any help is appreciated.

DataEnggNerd avatar Oct 24 '24 07:10 DataEnggNerd

No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.

aclark4life avatar Oct 25 '24 16:10 aclark4life