omero-py icon indicating copy to clipboard operation
omero-py copied to clipboard

createOriginalFileFromFileObj reads entire file into memory for SHA1 hash

Open titusz opened this issue 3 months ago • 1 comments

Problem

The method _BlitzGateway.createOriginalFileFromFileObj in src/omero/gateway/__init__.py line 4075 reads the entire file into memory to compute the SHA1 hash:

h.update(fo.read())  # Reads entire file into memory

This causes out-of-memory errors for files larger than available RAM.

Suggested Fix

Use chunked reading for SHA1 computation, similar to the upload logic that already uses 10KB chunks (lines 4093-4101):

h = sha1()
chunk_size = 10000
fo.seek(0)
while True:
    chunk = fo.read(chunk_size)
    if not chunk:
        break
    h.update(chunk)

Impact

Files larger than RAM cannot be uploaded through this method.

titusz avatar Sep 22 '25 18:09 titusz

Thanks for opening the issue and agreed on the limitations of the current implementation. Feel free to work on a Pull Request implementing your proposal alongside a completed CLA.

On a technical note,the code is already reading the source file in chunks as part of the upload via the raw file store

https://github.com/ome/omero-py/blob/780876c5f6b48c327ef1685b377bac2c3cc46797/src/omero/gateway/init.py#L4086-L4104

It might make sense to perform the SHA computation/update as part of the same loop to reduce the number of I/O operations especially for large files and update the OriginalFile post upload completion with the checksum.

sbesson avatar Sep 23 '25 07:09 sbesson