VirtualiZarr
VirtualiZarr copied to clipboard
problem with numpy type error (not serializable)
I see this
from virtualizarr import open_virtual_dataset
u = 'https://thredds.nci.org.au/thredds/fileServer/gb6/BRAN/BRAN2023/daily/ocean_salt_2024_06.nc'
ds = open_virtual_dataset(u)
ds.virtualize.to_kerchunk('/tmp/test.parquet', format = "parquet")
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/VirtualiZarr/virtualizarr/accessor.py", line 137, in to_kerchunk
# refs = dataset_to_kerchunk_refs(self.ds)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/VirtualiZarr/virtualizarr/writers/kerchunk.py", line 72, in dataset_to_kerchunk_refs
# ".zattrs": ujson.dumps(attrs),
# ^^^^^^^^^^^^^^^^^^
# TypeError: np.int32(20) is not JSON serializable
## drop the problem numpy attribute
ds.attrs['NumFilesInSet'] = None
## now it works
ds.virtualize.to_kerchunk('/tmp/test.parquet', format = "parquet")
I wonder if this typing in attributes has a general solution? Appreciate this may be a kerchunk topic
(it takes a few minutes to virtualize from URL I'm afraid, it's a 4.3Gb file)
This is an example where the correct behavior is simply whatever the kechunk spec says to do / the kerchunk library actually does. Clearly throwing an error is wrong, but otherwise it would be helpful to know what Kerchunk-like expected behavior is.
Are you able to serialize other numpy dtypes? Presumably we must be able to?
This attribute needs to just be coerced to a plain int.