CanlabCore icon indicating copy to clipboard operation
CanlabCore copied to clipboard

Rezipping in fmri_data may be unnecessary

Open jcf2 opened this issue 1 year ago • 2 comments

Currently fmri_data will uncompress .nii.gz inputs automatically using gunzip_image_names_if_gz:

[image_names, was_gzipped] = gunzip_image_names_if_gz(image_names);

This allows it to read .nii files. Depending on MATLAB gunzip and whether it deletes the original, this would result in either two files (.nii and .nii.gz) or just the .nii version. As housekeeping, fmri_data apparently assumes the latter as the worst case and rezips:

    % re-zip images if they were originally zipped
            % add .gz back to file names.
            if any(was_gzipped)
                
                image_names = re_zip_images(image_names, was_gzipped);
                
            end

where re_zip_images is a subfunction of fmri_data that calls MATLAB's gzip. This gzip call overwrites a .nii.gz if it exists, and re_zip_images then removes the .nii:

            gzip(deblank(image_names(i, :)));
            delete(deblank(image_names(i, :)))

This approach is unfortunately extremely costly, as gzip may take around 10x as much time as gunzip, and would seem to be totally unnecessary if the .nii.gz was in fact not deleted by gunzip. Also, it would seem that a replacement of .nii.gz files may invalidate cryptographic hashes calculated on the original.

So, one simple improvement would be to make the "housekeeping" step conditional, and if a .nii.gz exists remove the .nii instead.

In fact, the documentation for MATLAB's gzip (R2024a) says

gunzip does not delete the original GNU zip files.

and this seems to be true as well for past versions, so while OK to have that conditional "just in case" it seems rezip should be 100% avoidable here?

jcf2 avatar Sep 19 '24 14:09 jcf2

Note: the original approach was apparently to use the host system's gunzip

% [status, result] = system(['gunzip ' my_image]);

and there deletion of the original .gz would be the default (overridable by --keep). So that explains where the "worst case assumption" above comes from...

jcf2 avatar Sep 19 '24 14:09 jcf2

(oops, didn't mean to close this prematurely above... doh)

The above PR is designed to address efficiency related to zip with a minimal change -- just avoid an expensive and unnecessary call to gzip in fmri_data if the target already exists.

A separate issue that could be fixed by a deeper change is that fmri_data will fail on a readable .nii.gz if the directory is not writable (with a an obscure "internal error has occurred" originating from MATLAB gunzip). A fix that covers that as well could be to change the unzip structure:

  1. create a temporary directory: gunzip_temp = tempname;
  2. unzip there instead of in .gzip directory: gunzip(<target>, gunzip_temp);
  3. when the time comes for housecleaning, remove the temporary directory

Though in a way conceptually simpler, this approach would affect gunzip_image_names_if_gz as well as fmri_data (thus possibly affecting code beyond fmri_data), and would require passing the gunzip_temp location. Presumably in fmri_data it would also call for some renaming/restructuring, as re_zip_images would no longer be an accurate name.

jcf2 avatar Sep 20 '24 12:09 jcf2