for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute
python-irodsclient 2.0.1 irods 4.3.2
I am observing unnecessary pep-invokes (and irrelevant size information in the len attribute) when an object that its size is under the parallel threshold (size is less than 32BM) is downloaded. I think that is because of the extra reading (READ_BUFFER_SIZE ) for the small files here at https://github.com/irods/python-irodsclient/blob/bf46679d449ba382a766720eb3ece0a8600215d6/irods/data_object.py#L20
Called by https://github.com/irods/python-irodsclient/blob/bf46679d449ba382a766720eb3ece0a8600215d6/irods/manager/data_object_manager.py#L219
My observations:
- when a zero byte object is downloaded,
pep_api_data_obj_read_*is triggered only one time but the following message is captured. Please see the value for thelenserialized object.
- when an object that its size is more than zero (2bytes) but less than 32 MB is downloaded, pep_api_data_obj_read_* is triggered more than one time and the value in the
lenvaries in each invoke. - that is, for a 2 bytes file download, pep_api_data_obj_read_* is triggered three times and only the one in the middle invoke has the correct size information.
- for a 20MB file download,
pep_api_data_obj_read_*is triggered 5 times and none of them has the correct size information.
I have a kind of fix that ensures pep_api_data_obj_read_* is called only once, accurately populating the len field in the PEP. My tests haven't revealed any issues, but I'd like to get your input before submitting a pull request to confirm there won't be unintended side effects. Or might be anther nice way to fix this?
in /irods/data_object.py:
def chunks(f, chunksize=io.DEFAULT_BUFFER_SIZE):
pos = f.tell()
size = f.seek(0,os.SEEK_END)
f.seek(pos,os.SEEK_SET)
if size < 32 * ( 1024 ** 2) - 1:
return (size, f.read(size), b'')
return (size, iter(lambda: f.read(chunksize), b''))
L219 in/irods/manager/data_object_manager.py:
else:
no_chunk = chunks(o, self.READ_BUFFER_SIZE)
if no_chunk[0] < 32 * ( 1024 ** 2):
f.write(no_chunk[1])
return
for chunk in chunks(o, self.READ_BUFFER_SIZE)[1]:
f.write(chunk)
do_progress_updates(updatables, len(chunk))