python-irodsclient icon indicating copy to clipboard operation
python-irodsclient copied to clipboard

for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute

Open mstfdkmn opened this issue 1 year ago • 0 comments

python-irodsclient 2.0.1 irods 4.3.2

I am observing unnecessary pep-invokes (and irrelevant size information in the len attribute) when an object that its size is under the parallel threshold (size is less than 32BM) is downloaded. I think that is because of the extra reading (READ_BUFFER_SIZE ) for the small files here at https://github.com/irods/python-irodsclient/blob/bf46679d449ba382a766720eb3ece0a8600215d6/irods/data_object.py#L20

Called by https://github.com/irods/python-irodsclient/blob/bf46679d449ba382a766720eb3ece0a8600215d6/irods/manager/data_object_manager.py#L219

My observations:

  • when a zero byte object is downloaded, pep_api_data_obj_read_* is triggered only one time but the following message is captured. Please see the value for the len serialized object.

image

  • when an object that its size is more than zero (2bytes) but less than 32 MB is downloaded, pep_api_data_obj_read_* is triggered more than one time and the value in the len varies in each invoke.
  • that is, for a 2 bytes file download, pep_api_data_obj_read_* is triggered three times and only the one in the middle invoke has the correct size information.

image image image image

  • for a 20MB file download, pep_api_data_obj_read_* is triggered 5 times and none of them has the correct size information. image image image image image image

I have a kind of fix that ensures pep_api_data_obj_read_* is called only once, accurately populating the len field in the PEP. My tests haven't revealed any issues, but I'd like to get your input before submitting a pull request to confirm there won't be unintended side effects. Or might be anther nice way to fix this?

in /irods/data_object.py:

def chunks(f, chunksize=io.DEFAULT_BUFFER_SIZE):
    pos = f.tell()
    size = f.seek(0,os.SEEK_END)
    f.seek(pos,os.SEEK_SET)
    if size < 32 * ( 1024 ** 2)  - 1:
        return (size, f.read(size), b'')
    return (size, iter(lambda: f.read(chunksize), b''))

L219 in/irods/manager/data_object_manager.py:

else:
    no_chunk = chunks(o, self.READ_BUFFER_SIZE)
    if no_chunk[0] < 32 * ( 1024 ** 2):
       f.write(no_chunk[1])
       return
  for chunk in chunks(o, self.READ_BUFFER_SIZE)[1]:
       f.write(chunk)
       do_progress_updates(updatables, len(chunk))

mstfdkmn avatar Aug 05 '24 17:08 mstfdkmn