mongo-arrow icon indicating copy to clipboard operation
mongo-arrow copied to clipboard

Trouble reading documents with empty embedded arrays

Open ccrouch opened this issue 1 year ago • 6 comments
trafficstars

Goal: Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.

Expected result: Parquet file created

Actual Result: Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.

Steps to reproduce:

from pymongo import MongoClient

import pymongoarrow.api as pmaapi

import pyarrow.parquet as papq
import pyarrow.json as pajson

import io
import json
import bson


client = MongoClient()
collection = client.testdb.data;
collection.drop();

client.testdb.data.insert_many([
    { '_id': 1, 'foo':  { 'bar': ['1','2'] } },
    { '_id': 2, 'foo':  { 'bar': [] } }
])

# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2) 
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')

# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')



doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2) 
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')

pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')

produces

$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
  child 0, bar: list<item: null>
      child 0, item: null
----
_id: [[2]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
  File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
    pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
    process_bson_stream(batch, context)
  File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
  File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
  File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
  File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
  File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field

FWIW the three parquet files which are produced, duckdb shows the following...

D select * from 'pyarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pymongoarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int32 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pyarrow2.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar integer[]) │
├───────┼───────────────────────┤
│     2 │ {'bar': []}           │
└───────┴───────────────────────┘
D 

Versions:

Python 3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0] on linux
Successfully installed dnspython-2.6.1 numpy-1.26.4 packaging-23.2 pandas-2.2.2 pyarrow-15.0.2 pymongo-4.7.1 pymongoarrow-1.3.0 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1

ccrouch avatar May 05 '24 14:05 ccrouch

Hi @ccrouch, thanks for pointing out the limitation in our parser. I opened https://jira.mongodb.org/browse/ARROW-230 to track the fix.

blink1073 avatar May 07 '24 01:05 blink1073