python-dwca-reader icon indicating copy to clipboard operation
python-dwca-reader copied to clipboard

Using chunksize gives `TypeError: 'TextFileReader' object does not support item assignment`

Open nigelcharman opened this issue 1 year ago • 4 comments

We've been using python-dwca-reader with no problems loading about 13k occurrences. We now need to scale it up to load about 3.25m occurrences.

Changing the code from:

        core_df = dwca.pd_read('occurrence.txt', parse_dates=True)

to:

        for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
        ...

causes the error:

    ...
    for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/opt/asdf/installs/python/3.11.7/lib/python3.11/site-packages/dwca/read.py", line 209, in pd_read
    df[shorten_term(field['term'])] = field_default_value
    ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'TextFileReader' object does not support item assignment

Looking at gbif-alert, I see that you're using enumerate(dwca) rather than reading it in chunks, so I'll give that a try.

nigelcharman avatar Jul 02 '24 10:07 nigelcharman

We're now using enumerate(dwca) so we're in no rush to have this corrected. I'll leave the issue open though in case other people come across it.

nigelcharman avatar Jul 04 '24 12:07 nigelcharman

Note to self: it only happens with the combination of chunksize (and probably also the iterator parameter) and the DwCA using default values (because pd_read returns a TextFileReader rather than a regular data frame)

niconoe avatar Jul 08 '24 11:07 niconoe

After careful inspection I can't see any sane way to deal with this specific combination (pd_read returning TextFileReader objects because of its parameters and the DwC-A using default values).

I therefore decided to document the incompatibility + add a human readable exception for that situation. This is also tested.

niconoe avatar Jul 08 '24 11:07 niconoe

Would it be worth adding a note to https://python-dwca-reader.readthedocs.io/en/latest/pandas_tutorial.html too? It was this documentation that led me to believe that this combination might be possible.

nigelcharman avatar Jul 08 '24 17:07 nigelcharman