mongo-arrow
mongo-arrow copied to clipboard
[ Feature request] `coerce_number_to_str` - like optional flag while reading data to handle known datatype inconsistencies
While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.
MongoDB documentation
[
{
"name": "test",
"code": "1"
},
{
"name": "test",
"code": 1
}
]
Current implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘
In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.
Expected implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query,
coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# └─────────────────────────────────┴──────┴──────┘
Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field
Thank you! Tracking in JIRA https://jira.mongodb.org/browse/ARROW-252
@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?
@aclark4life I have seen the comment in jira ticket attached. Shall we discuss about the proposed change here?
Yes! Are you able to send a PR with the proposed changes?
@aclark4life I would like to discuss the design before getting into implementation. In Jira I have observed that there is a suggestion of a new data type, which I am fine with. But, on such implementation, schema is expected to be passed only for such field. And how to pass schema for nested keys?
Any help is appreciated.
No problem! Does this help at all? https://mongo-arrow.readthedocs.io/en/1.3.0/schemas.html#nested-data-with-schema I believe we're in agreement that we could support adding a new field type StrToIntField or IntToStrField as @ShaneHarvey suggested.