mongo-arrow [ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies

While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.

MongoDB documentation

[
    {
        "name": "test",
        "code": "1"
    },
    {
        "name": "test",
        "code": 1
    }
]

Current implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘

In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.

Expected implementation

from pymongoarrow.api import find_polars_all

query_result_df = find_polars_all(
            collection=client,
            query=query,
            coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id                             ┆ name ┆ code │
# │ ---                             ┆ ---  ┆ ---  │
# │ binary                          ┆ str  ┆ str  │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1    │
# └─────────────────────────────────┴──────┴──────┘

Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field

Oct 01 '24 14:10 DataEnggNerd

Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252

Oct 01 '24 19:10 aclark4life

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Oct 22 '24 07:10 DataEnggNerd

@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?

Yes! Are you able to send a PR with the proposed changes?

Oct 22 '24 16:10 aclark4life

@aclark4life I would like to discuss the design before getting into implementation. In Jira I have observed that there is a suggestion of a new data type, which I am fine with. But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?

Any help is appreciated.

Oct 24 '24 07:10 DataEnggNerd

No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.

Oct 25 '24 16:10 aclark4life

mongo-arrow mongo-arrow copied to clipboard

[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies

MongoDB documentation

Current implementation

Expected implementation

mongo-arrow
mongo-arrow copied to clipboard