nibabel icon indicating copy to clipboard operation
nibabel copied to clipboard

Helper function for saving compact integer datatypes

Open neurolabusc opened this issue 3 years ago • 38 comments

I aided a nibabel user who wrote a function to segment a brain into 4 tissue types (0=non brain, 1= gray matter, 2= white matter, 3= CSF). They were surprised at the huge disk space resulting from the NIfTI images, which were saved as DT_INT64 datatype. Beyond being large, this is a very unusual datatype for NIfTI (it did not exist in the ancestor Analyze type).

The native NumPy datatype for integers is int64. So nibabel is preserving this datatype. I do wonder if the Nifti1Image command could include an optional parameter that chooses the most compact data type if the input data is of an integer type (e.g. in this example, the voxel intensities range from 0..47, so it could be saved losslessly as DT_UINT8, requiring 1/8th the disk space.

When I look at the example nibabel examples they all show sensible use setting the dtype

data = np.ones((32, 32, 15, 100), dtype=np.int16)

However, the rational for these is never described. Perhaps the documentation could not that choice of datatype impacts file size and some tools only support a limited range of these datatypes (the classic Analyze data types are UINT8, INT16, FLOAT32 and some tools can act in unexpected ways with other datatypes, e.g. AFNI promotes UINT16 to FLOAT32).

import nibabel as nib
import numpy as np

data = np.arange(4*4*3).reshape(4,4,3)
#data.dtype = 'int64'
#data values are integers 0..47
img = nib.Nifti1Image(data, affine=np.eye(4))
nib.save(img, 'test.nii')

data8 = np.arange(4*4*3, dtype=np.uint8).reshape(4,4,3)
img8 = nib.Nifti1Image(data8, affine=np.eye(4))
nib.save(img8, 'test8.nii')

data32 = np.arange(4*4*3, dtype=np.float32).reshape(4,4,3)
img32 = nib.Nifti1Image(data32, affine=np.eye(4))
nib.save(img32, 'test32.nii')

neurolabusc avatar Aug 19 '21 13:08 neurolabusc

Thinking a little about this, I think we can do something like:

def get_min_dtype(img):
    arr = np.asanyarray(img.dataobj)
    min_dt = arr.dtype
    if np.issubdtype(min_dt, np.integer):
        min_dt = np.promote_types(np.min_scalar_type(arr.min()), np.min_scalar_type(arr.max())
    return min_dt

And of course we could set with img.set_data_dtype(). What would your ideal API for invoking this behavior look like?

effigies avatar Aug 19 '21 15:08 effigies

Related to to able_int_type and usage in test_casting.py. I think someone more familiar with nibabel could suggest the best API.

neurolabusc avatar Aug 19 '21 16:08 neurolabusc

Just coming back to this, as @neurolabusc just prodded me (thanks Chris).

Thinking about it quickly now, we have a few options for optional support (sketches).

Starting with arr = np.arange(4*4*3).reshape(4,4,3), we could shrink down to np.int8 with any of:

  • img = nib.Nifti1Image(as_min_int(arr), np.eye(4))
  • img = nib.NiftiImage(arr, np.eye(4)); small_img = as_min_int_img(img)
  • `img = nib.Nifti1Image(arr, np.eye(4), shrink_int_dtype=True)

Then we could do the shrinking automatically - as in shrink_int_dtype=True as the default - but that seems like too much of a behavior change.

Any thoughts about those options?

matthew-brett avatar Jan 19 '22 18:01 matthew-brett

My sense would be to shrink integers automatically unless the user disables the feature. My sense is that float types are used when precision is needed, and integers are used by nibabel users for segmentation (3-6 discrete levels), labels (thousands of regions at most) and masks (binary). In each case, UINT8 or INT16 would be sufficient. Saving data as INT64means that the resulting images are incompatible with many popular tools: AFNI, FreeSurfer, Slicer 3D, SPM12, etc. The INT64 also has implications for disk space and RAM when loaded.

As someone who gets feedback from nibabel users, it seems like there are clear unintended consequences for the current default of saving as INT64. I can actually not think of a real world situation where someone would need INT64 data, and such peculiar situations would not be compatible with the standard tools in our field. To me, the default behavior should be compatible with the popular tools in our field.

neurolabusc avatar Jan 19 '22 20:01 neurolabusc

Other options I can think of, with less impact on changing default behavior:

  • Truncating int64 to int32 by default.
  • Adding dtype aliases like dtype='mask' (int8) dtype='classification' (int8) and dtype='label' (int16 unless any number is out of range).

Chris - would uint8 or uint16 cause compatibility problems do you think?

matthew-brett avatar Jan 19 '22 23:01 matthew-brett

This table shows support for of AFNI, FSL, FreeSurfer (FrSr) MRIcroGL (MRGL) and SPM. For each cell I indicate if the tool support the Read (R), Write (W) and Internal use (I) of the datatype. The datatypes indicated with the asterisk (*) were supported by the ancestor Analyze format, and have historically been the most popular and are generally the best supported.

Type AFNI FSL FrSr MRGL SPM
DT_INT8 R R R R RW
DT_UINT8* RWI RW RWI RWI RW
DT_INT16* RWI RW RWI RWI RW
DT_UINT16 R R R R RW
DT_INT32* RW RW RWI RW RW
DT_UINT32 R R R R RW
DT_INT64 R
DT_UINT64 R
DT_FLOAT32* RWI RWI RWI RWI RW
DT_FLOAT64 R RWI R R RWI

In general, if a tool does not internally support a datatype, it is stored internally as an internally supported FLOAT type. For AFNI and fslmaths (using default -dt float) this rounds extreme integer values due to flintmax. For completelness, there are exceptions to this rule, e.g. MRIcroGL will adjust the scale_intercept to losslessly internally store UINT16 as INT16.

This promotion means that memory demands and disk usage can be impacted by the datatype choice.

Historically, MRIs used 12-bit ADC and therefore the DICOMs stored 12 bits per pixel, which worked nicely with the popular INT16 datatype. Recently, many leading edge scanners have moved to 16-bit ADC with the resulting magnitude images often scaled to use the full UINT16 range (0..65535, even if this precision is beyond the SNR for the raw data). This is a challenge for tools like dcm2niix: should the UINT16 data be converted to the well supported FLOAT32, saved as the poorly supported UINT16, or save as INT16 if the maximum value is less than 32767 and UINT16 otherwise?

These choices have had rare unintended consequences that are directly relevant to this discussion. For example, consider a fMRI study where each individual completes two series of fMRI runs. It is possible (though unlikely) that the MRI reconstructer will generate magnitude images for one run in the range 0..32767, while the other run will generate some voxels that exceed 32768. In this case, it seems efficient to store one as INT16 (for better legacy support) while the other requires either UINT16 or FLOAT32. A tool like AFNI will promote UINT16 to FLOAT32, while natively supporting the INT16. Unfortunately, AFNI requires that all first level fMRI data has the same datatype and will create bogus statistical maps.

My own suggestion is to stick with the legacy datatypes (listed with * in the table) as these have the most widespread support.

neurolabusc avatar Jan 20 '22 12:01 neurolabusc

@effigies : this issue came up again in a recent discussion. Is there any further thought for how nibabel will output integers? Having int64 be the default int type for outputs, regardless of input type or min/max int value, seems undesirable on many fronts. It increases user computer disk/memory space usage unnecessarily, creates inconsistency of default I/O types (that is, input and output types for simple operations does not match) and creates software interoperability problems.

For example, it would seem to make sense that if a dataset comes in with a particular type and the operation performed doesn't require changing that type, then the output dataset should keep the input type? *Asterisk to prove the rule: now that int64s dsets have been created, though, it would seem that in most cases, propagating these unecessarily---i.e., when extremal values preclude any smaller int type---would not be desirable.

Starting with trying to have output match the input type, but then adapt/"guess" a necessary alternative, seems like it might be preferable and not too bad to implement?

mrneont avatar Feb 08 '22 15:02 mrneont

I do get the point that int64 doesn't have obvious use cases, so it does seem sensible to provide a way for users to say "cast ints down", but I feel that changing the default behavior is problematic (see below).

Matching input type is pretty straightforward with nibabel, but it is the programmer's responsibility to track inputs:

img = nb.load(fname)
new_data = ... # derive from img.dataobj or generate it somehow
new_img = nb.Nifti1Image(new_data, img.affine, img.header)
# Make any additional changes to the header, if needed
new_img.to_filename(new_fname)

Nibabel has no concept of "input type" if the header of an input image is not passed to a new image, but if it is, I think the behavior is what you would expect.

The problem is more what happens when someone doesn't provide a header that lets nibabel know how to interpret the data array. They can set_data_dtype() if they like, or have it set automatically by coercing the data array before creating a new image with it. Downcasting by default will be a significant and possibly unwelcome change in behavior to people who are used to nibabel doing what it's told.

Consider if I actually did want an int64 array. Right now I would write:

>>> img = nb.Nifti1Image(np.arange(4*4*3, dtype='int64').reshape(4,4,3), np.eye(4))
>>> img.get_data_dtype()
dtype('<i8')
>>> img.set_data_dtype("int64")
>>> img.get_data_dtype()
dtype('<i8')
>>> new_img = nb.Nifti1Image.from_bytes(img.to_bytes())  # round-trip through an in-memory filehandle
>>> new_img.get_data_dtype()
dtype('<i8')

Note that being explicit does not produce a different result in the image from receiving an array of Python integers and converting them to a numpy array. So if we change the default behavior, people who have been going out of their way to be explicit may stop getting what they expect.


All this aside, @neurolabusc's earlier point that the documentation can be improved is quite true. The only example where we pass an existing header (that I was able to find quickly, anyway) is https://nipy.org/nibabel/nifti_images.html#default-sform-and-qform-codes.

effigies avatar Feb 08 '22 16:02 effigies

Hi @effigies, @neurolabusc kindly brought this conversation to my attention. I feel that in some way the nibabel API's default behavior did change, when the output of a Nifti1Image dataobj became int64 rather than int32 before. We already had to deal with this in tedana (see here for when I had to add a small patch), and to some degree I'd expect similar problems in other packages. It is becoming more common for users to use tools like nilearn, which uses this API, and then go back into some other package. Having to manually track data types is something that is probably not going to be intuitive for most of them, and I can think of at least one case where I've had to walk somebody through why this is happening (since there is no obvious indication of what happened to an end-user until they receive a load failure). Perhaps a fair API change would be to add a method to the API that would cast the data object into the same datatype as what's in the NIFTI header, something like into_ndarray? Then this way the user is guaranteed a type-clear way of getting their data into the commonly used ndarray format.

jbteves avatar Feb 08 '22 17:02 jbteves

I don't think the behavior of Nibabel has changed . Maybe you were using 32-bit Windows at one point, where the default NumPy integer is 32-bit? Otherwise the default is 64 bit - and Nibabel - as far as I know - has always just accepted the data type you pass it, unless the header tells you what type you have.

I think I don't get what you're suggesting with the into_ndarray API - or maybe - how it relates to the int64 default NumPy datatype. I think you mean that if the header says that the on-disk storage type is int8 for example, that there should be a way of getting the data as np.int8 datatype - I guess throwing away any scalefactors? If the image does not have scalefactors, np.array(img.dataobj) does what you want.

matthew-brett avatar Feb 08 '22 18:02 matthew-brett

I feel that in some way the nibabel API's default behavior did change, when the output of a Nifti1Image dataobj became int64 rather than int32 before.

The logic for np.asanyarray(img.dataobj) should not have changed, and int64 should only result if the on-disk dtype is int64 or you pass data such that np.array(data).dtype == np.dtype('int64'). (You can get floats out of on-disk ints if there are scale factors.) If you can reproduce a situation where int32 is coerced up to int64, this is a bug, not an intentional API change.

Perhaps a fair API change would be to add a method to the API that would cast the data object into the same datatype as what's in the NIFTI header, something like into_ndarray? Then this way the user is guaranteed a type-clear way of getting their data into the commonly used ndarray format.

If I'm understanding you correctly, this aims to solve common problems with chaining operations on in-memory images, which do not enforce that img.dataobj.dtype == img.get_data_dtype()? I could imagine something equivalent to (but hopefully more efficient than) the following to ensure that an in-memory image appears identical to what it would look like if freshly read from disk:

class Nifti1Image:
    def refresh(self):
        return self.from_bytes(self.to_bytes())

But I feel like we're going a bit afield of the issue at hand. These issues seem connected in that there are expectation violations. I think the basic problem is that most suites read/write to disk at every point of user intervention, while working in Python allows you to defer the serialization (and possible precision loss) until the last possible moment.

effigies avatar Feb 08 '22 18:02 effigies

Pragmatically, the current behavior of nibabel is incompatible with the most popular NIfTI tools in the field. In practice, it is causing grief to a lot of users. If the nibabel team really wants to support this data format, they should develop and validate patches for the leading tools in our field and submit pull requests (SPM, FSL, AFNI, FreeSurfer). The current behavior is not in the spirit of a popular interchange format. I have updated my software simply to stop emails from people who use my tools. However, this is fixing the symptom: my tools are small and simple and validating solutions is much easier than for larger projects.

As nilearn is gaining popularity, its dependency on nibabel is generating a large number of images that are not compatible with other tools. NIfTI is an interchange format, and while legal, the DT_UINT64 and DT_INT64 data types are not used by other tools. In the same way, I can create legal DICOM images where the transfer syntax is unsupported by virtually every tool. While the image is technically legal, it will frustrate users. I would argue that nibabel should never save INT64 data. A nice solution would be if a NIfTI image is requested with dtype='int64', nibabel uses the most compact form of the common integer datatypes (DT_UINT8, DT_INT16 and DT_INT32). If the data range exceeds DT_INT32, I think nibabel should generate a data dtype "int64" not supported. In practice, this is not a supported NIfTI format, and it breaks compatibility with many other tools. There is already a precedence for this: DT_BINARY is a legal NIfTI data type, however it is unsupported by virtually every tool and nibabel prevents a user from creating this technically legal but pragmatically incompatible image:

import numpy as np
import nibabel as nib
data2 = np.array([1,0,1,0,1,0,1,1,1], dtype=bool).reshape(3,3,1)
img2 = nib.Nifti1Image(data2, affine=np.eye(4))

nibabel.spatialimages.HeaderDataError: data dtype "bool" not supported

I want to be clear that I can not think of a single real world situation in our field where a user would prefer an integer exceeding 2147483648 to a floating point representation. Even in such an extreme edge situation, one would think the FLOAT64 flintmax of 9007200000000000 would be sufficient.

The nibabel developers need to be conscientious of the knock on effects of their implementations. There are ripple effects to developers of other tools in situations where the user misdiagnoses the source of an error.

neurolabusc avatar Feb 08 '22 18:02 neurolabusc

My apologies, np.asanyarray works as intended and this is a nilearn behavior in our specific implementation, which I realize must be frustrating to read since you're not nilearn (been there). That said...

If I'm understanding you correctly, this aims to solve common problems with chaining operations on in-memory images, which do not enforce that img.dataobj.dtype == img.get_data_dtype()?

I do believe it is is pertinent to the issue since what you're suggesting so far is that users must check for themselves that an int64 will not be saved. My understanding of nibabel.save is that it will select the data type of array, rather than image data type. I believe that this is an undue and confusing burden on the user, who may believe that checking the type of the dataobj or the header would be interchangeable in the Nifti1Image API (which does not indicate from my read that these are not interchangeable in the API docs). A read of the Nifti1Image constructor also does not indicate that a header-array mismatch would not be reconciled. At the very least, a Warning could indicate to the users that their array is not the correct datatype for their supplied header.

I agree with the above assessment by @neurolabusc with the exception that I think it may be reasonable to raise a warning rather than fail to support it outright. However, I do think that because most tools do not support this data type, and because it is very rare that such a datatype is even desirable to write to disk, it is imperative that users be made aware of the potential compatibility issues.

I would be happy to help implement either solution.

jbteves avatar Feb 08 '22 19:02 jbteves

I disagree with completely disabling (u)int64. I don't see it as nibabel's job to prevent people from writing data that others can't read, but I do see the value in defaulting to things they can. I would propose we add a dtype parameter to Nifti1Image.to_filename() (and all related methods) that would have the following semantics:

  • If dtype is None (default):
    • Check img.get_data_dtype(). If (U)INT64, coerce down to the smallest integer that can encode values.
  • Else use np.dtype(dtype), and the user accepts the consequences.

There might be some corner cases that need to be handled related to rescaling floats, but would this suffice? I think in practice it will maintain current behavior except for int64, and so be minimally disruptive. We could also include a "smart" option like dtype="compat" that will force into analyze-compatible or error.

effigies avatar Feb 08 '22 19:02 effigies

From my point of view that's perfect.

jbteves avatar Feb 08 '22 20:02 jbteves

@effigies your proposal sounds reasonable to me.

neurolabusc avatar Feb 08 '22 20:02 neurolabusc

My understanding of nibabel.save is that it will select the data type of array, rather than image data type.

Just to clarify - if the dtype is stored in the image header, then Nibabel will save as that dtype, rather than the array dtype:

[ins] In [15]: hdr = nib.Nifti1Header()
[ins] In [16]: hdr.set_data_dtype(np.int8)
[ins] In [17]: arr = np.zeros((2, 3, 4), dtype=np.int64)
[ins] In [19]: nib.save(nib.Nifti1Image(arr, np.eye(4), hdr), 'out.nii')
[ins] In [21]: img = nib.load('out.nii')
[ins] In [22]: img.get_data_dtype()
Out[22]: dtype('int8')
[ins] In [23]: np.array(img.dataobj).dtype
Out[23]: dtype('int8')

So, if Nibabel knows about the dtype, it preserves it. The situation that's causing the trouble, is when Nibabel knows nothing about the dtype except the dtype of the array.

matthew-brett avatar Feb 08 '22 21:02 matthew-brett

@effigies - I think the problem we need to think about here is:

arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
nib.save(nib.Nifti1Image(arr, np.eye(4)), 'test.nii')
arr_back = np.array(nib.load('test.nii')  # Will be int64, as things stand
my_vals = arr_back * 100   # Oops

As things are, all will be well, and we'd get:

Out[29]: 
array([[[   0,  100,  200,  300],
        [ 400,  500,  600,  700],
        [ 800,  900, 1000, 1100]],

       [[1200, 1300, 1400, 1500],
        [1600, 1700, 1800, 1900],
        [2000, 2100, 2200, 2300]]])

If we quietly truncate to e.g. uint8, we get:

[ins] In [31]: arr_back.astype(np.uint8) * 100
Out[31]: 
array([[[  0, 100, 200,  44],
        [144, 244,  88, 188],
        [ 32, 132, 232,  76]],

       [[176,  20, 120, 220],
        [ 64, 164,   8, 108],
        [208,  52, 152, 252]]], dtype=uint8)

Lots of integer overflow because of the working np.uint8 datatype.

So, this could easily cause some nasty and unexpected breakage, for people using img.dataobj. And I suspect that is fairly common.

matthew-brett avatar Feb 08 '22 22:02 matthew-brett

@matthew-brett the nibabel use cases I have seen for integers were masks and segmentation maps. For these, one might expect binary operations, but not the sort of manipulation you describe. Common tools in our field like fslmaths and spm distinguish the data type used for computation from the data type stored on disk (the -dt and -odt in fslmaths), using floating point for the computation.

neurolabusc avatar Feb 08 '22 22:02 neurolabusc

It sounds analogous to reading in floats, doing intermediate calcs with doubles to not lose precision in the intermediate steps, and then outputting (as presumably desired) floats.

Sure, there will be data conversion issues, depending on operations involved. I did phrase my suggestion as: For example, it would seem to make sense that if a dataset comes in with a particular type and the operation performed doesn't require changing that type, then the output dataset should keep the input type?

(Sidenote: with my nibabel v3.2.1, I had to do the following to get the above example to work without error:

arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
nib.save(nib.Nifti1Image(arr, affine=np.eye(4)), 'test.nii')
arr_back = np.asanyarray(nib.load('test.nii').dataobj) # Will be int64, as things stand

print(arr_back * 100)
print(arr_back.astype(np.uint8) * 100)

)

Python does automatic type recasting in some cases--- consider either output above if one multiplied by 100. instead of by 100. Similarly, if one divide by 100, or even by 1 (in Python3...). The issue in the above case is that Python won't automagically promote within the int family, even though it "should" to not lose value here.

It would seem fine to do intermediate calcs using int64, sure, but then to have the output type be a more appropriate int varietal.

mrneont avatar Feb 08 '22 23:02 mrneont

@mrneont - yes - sure - the point I was making was that the output type often becomes the type used for intermediate calculations - and that, with this backwards incompatible change- this will appear as silent breakage from integer overflow. I do take Chris' point that this will probably be rare - but it's still a very bad thing to do, to silently change calculations.

I think we've got to find some way of preventing that happening, and in practice, I think that means we've got to find some way of warning then breaking here. I would vote strongly for avoiding silent breakage - can argue further if that's controversial.

For example - we could do something like:

# Warn
arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
img = nib.Nifti1Image(arr, np.eye(4))
# Maybe
# Warning: using data type np.int64 from array, but this is poorly supported
# by other Nifti tools.  This will cause an error in a future version of Nibabel
# Silence this warning with
# Nifti1Image(arr, np.eye(4), dtype=np.int64)
# if you really want the int64 output type, or specify another datatype, e..g.
# Nifti1Image(arr, np.eye(4), dtype=np.int8)
# or
# Nifti1Image(arr, np.eye(4), dtype='smallest-int')
# See docstring for details

matthew-brett avatar Feb 08 '22 23:02 matthew-brett

Sorry - just to clarify - later:

img = nib.Nifti1Image(arr, np.eye(4))

would go straight to an error, with the message above. Therefore, the results from Nibabel code now will stay the same in later versions, or generate a specific error that gives the fix - and we never generate data that can silently result in different results for later versions of Nibabel.

matthew-brett avatar Feb 08 '22 23:02 matthew-brett

Okay, so if I understand, @matthew-brett you would rather add a dtype parameter and have this warning emitted at image creation time, rather than at serialization time. That would be fine with me.

Would it be useful to have both? Then the equivalent of -dt double and -odt float would be:

img = nb.Nifti1Image(data, affine, dtype='float64')
final_img = some_func(img)
final_img.to_filename(fname, dtype='float32')

This way the code does not need to know in detail what some_func does to generate a new image.


Regarding using np.asanyarray(img.dataobj) instead of img.get_fdata(), this has always been a "You know what you're doing" mode, so if someone does fetch uint8s out of a label image and multiply them out of range without that seems to be on them. I'm okay with only casting int64 down to int32 by default, though.

effigies avatar Feb 10 '22 15:02 effigies

I do think casting int64 to int32 would prevent a lot of compatibility issues. The range of int32 should avoid any reasonable overflow. From my perspective, DT_BINARY and DT_INT64 both exist in the specification, but are virtually unsupported by tools. Since NIfTI is our field's lowest common denominator interchange format, we need be wary of how other tools handle the data. My sense is if an advanced user really wanted INT64, they would be best served storing this in .npy not NIfTI.

neurolabusc avatar Feb 10 '22 15:02 neurolabusc

@effigies - yes - I think this is an image instantiation-time thing - where we don't have any ideas for the data type, and we have gone, up until now, with the array data type.

I think I didn't follow about the extra output check. What would that solve? Is the idea that some_func in your example might accidentally make an int64 image? Wouldn't it hit the same warning in that case?

For the np.asarray(img.dataobj) - I agree that the user needs to accept some risk - but I'm really worried about the possibility that they thought they were accepting one risk, and tested against that - but another risk appeared and started breaking their scripts, silently. Here's some discussion of the rule I was thinking of for back-compatibility : https://mail.python.org/archives/list/[email protected]/message/UYARUQM5LBWXIAWBAPNHIQIDRKUUDTEK/

Summary:

Under (virtually) no circumstances should new versions of a scientific package silently give substantially different results for the same function / method call from a previous version of the package.

matthew-brett avatar Feb 10 '22 15:02 matthew-brett

We could also add the ability to do something like:

from nibabel.__future__ import int64_error

which would upgrade the warning to an error immediately.

matthew-brett avatar Feb 10 '22 15:02 matthew-brett

I think I didn't follow about the extra output check. What would that solve?

The issue here is that the on-disk and in-memory formats serve two different purposes. If I'm passing around in-memory images from function to function, I may want to ensure that I have large dtypes to avoid overflow/precision-loss issues, but then still save to some smaller dtype for space/compatibility reasons.

I think the float64 in memory -> float32 at output is the more likely case than int64 -> int32 one, but I'm trying to think of a consistent interface. In any event, warning that somebody is about to write an unreadable (by everybody but us) file seems to reasonable, even if most of the time it can be caught at instantiation.

For the np.asarray(img.dataobj) - I agree that the user needs to accept some risk - but I'm really worried about the possibility that they thought they were accepting one risk, and tested against that - but another risk appeared and started breaking their scripts, silently.

This is a fair point. Here's a process that includes both warnings at instantiation/write:

>>> nb.Nifti1Image(np.array([[[0, 1, 2]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Converting to int32. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[2147483648]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Converting to uint32. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[9223372036854775807]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Data range exceeds 32-bit range. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[0, 1, 2]]]), np.eye(4), dtype='int64').to_filename("/tmp/a.nii")
FutureWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
In the future, this will cause an error unless dtype='int64' is passed. Pass a dtype parameter \
to set the data type and suppress this warning."

We could also add the ability to do something like:

from nibabel.__future__ import int64_error

which would upgrade the warning to an error immediately.

I like this idea.

effigies avatar Feb 10 '22 15:02 effigies

The issue here is that the on-disk and in-memory formats serve two different purposes. If I'm passing around in-memory images from function to function, I may want to ensure that I have large dtypes to avoid overflow/precision-loss issues, but then still save to some smaller dtype for space/compatibility reasons.

I'd expect a function that can accept artibrary images to chose its working precision - otherwise it's going to run into lots of trouble for int16 etc.

I do see the case for the float32 output though.

But in any case - can we get agreement at least on the warn -> raise for image instantiation + future import combination? Maybe we can make another issue for the output dtype warn -> raise, if that's not completely resolved?

matthew-brett avatar Feb 10 '22 15:02 matthew-brett

But in any case - can we get agreement at least on the warn -> raise for image instantiation + future import combination?

That's fine with me.

Maybe we can make another issue for the output dtype warn -> raise, if that's not completely resolved?

TBH I think I prefer it as an interface to set_data_dtype(), but agree that it can be treated separately.

effigies avatar Feb 10 '22 16:02 effigies

So, then is the idea that nibabel will:

  1. read in a dset and forget the input datatype
  2. work in any precision internally (e.g., int64)
  3. output the dset with some appropriate datatype, independent of the input dset. That is, the datatype from Step 1 is never propagated?

From the discussion above, I only see int32 mentioned---will that be the default type for any integer-valued data? So many datasets people have fit the short range---most parcellations tissue segmentations and ROI maps (if not byte, for masks). Would it be worth checking the to-be-output dset range for a smaller footprint datatype (like short or byte)?

mrneont avatar Feb 10 '22 16:02 mrneont