python-dwca-reader
python-dwca-reader copied to clipboard
Using chunksize gives `TypeError: 'TextFileReader' object does not support item assignment`
We've been using python-dwca-reader with no problems loading about 13k occurrences. We now need to scale it up to load about 3.25m occurrences.
Changing the code from:
core_df = dwca.pd_read('occurrence.txt', parse_dates=True)
to:
for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
...
causes the error:
...
for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/opt/asdf/installs/python/3.11.7/lib/python3.11/site-packages/dwca/read.py", line 209, in pd_read
df[shorten_term(field['term'])] = field_default_value
~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'TextFileReader' object does not support item assignment
Looking at gbif-alert, I see that you're using enumerate(dwca) rather than reading it in chunks, so I'll give that a try.
We're now using enumerate(dwca) so we're in no rush to have this corrected. I'll leave the issue open though in case other people come across it.
Note to self: it only happens with the combination of chunksize (and probably also the iterator parameter) and the DwCA using default values (because pd_read returns a TextFileReader rather than a regular data frame)
After careful inspection I can't see any sane way to deal with this specific combination (pd_read returning TextFileReader objects because of its parameters and the DwC-A using default values).
I therefore decided to document the incompatibility + add a human readable exception for that situation. This is also tested.
Would it be worth adding a note to https://python-dwca-reader.readthedocs.io/en/latest/pandas_tutorial.html too? It was this documentation that led me to believe that this combination might be possible.