mongo-arrow
mongo-arrow copied to clipboard
Trouble reading documents with empty embedded arrays
Goal: Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.
Expected result: Parquet file created
Actual Result: Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.
Steps to reproduce:
from pymongo import MongoClient
import pymongoarrow.api as pmaapi
import pyarrow.parquet as papq
import pyarrow.json as pajson
import io
import json
import bson
client = MongoClient()
collection = client.testdb.data;
collection.drop();
client.testdb.data.insert_many([
{ '_id': 1, 'foo': { 'bar': ['1','2'] } },
{ '_id': 2, 'foo': { 'bar': [] } }
])
# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2)
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')
# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')
doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2)
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')
pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')
produces
$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
child 0, bar: list<item: string>
child 0, item: string
----
_id: [[1]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
child 0, bar: list<item: string>
child 0, item: string
----
_id: [[1]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
child 0, bar: list<item: null>
child 0, item: null
----
_id: [[2]]
foo: [
-- is_valid: all not null
-- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
process_bson_stream(batch, context)
File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field
FWIW the three parquet files which are produced, duckdb shows the following...
D select * from 'pyarrow1.parquet';
┌───────┬───────────────────────┐
│ _id │ foo │
│ int64 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│ 1 │ {'bar': [1, 2]} │
└───────┴───────────────────────┘
D select * from 'pymongoarrow1.parquet';
┌───────┬───────────────────────┐
│ _id │ foo │
│ int32 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│ 1 │ {'bar': [1, 2]} │
└───────┴───────────────────────┘
D select * from 'pyarrow2.parquet';
┌───────┬───────────────────────┐
│ _id │ foo │
│ int64 │ struct(bar integer[]) │
├───────┼───────────────────────┤
│ 2 │ {'bar': []} │
└───────┴───────────────────────┘
D
Versions:
Python 3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0] on linux
Successfully installed dnspython-2.6.1 numpy-1.26.4 packaging-23.2 pandas-2.2.2 pyarrow-15.0.2 pymongo-4.7.1 pymongoarrow-1.3.0 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1
Hi @ccrouch, thanks for pointing out the limitation in our parser. I opened https://jira.mongodb.org/browse/ARROW-230 to track the fix.