xbitinfo
xbitinfo copied to clipboard
Flagging quantised input data formats
Description
As discussed with @milankl and @Ishaanj18 the calculation of the bitwise information should be calculated on the 'rounded' bit-representation if rounding has been applied. In case of linear quantisation for example, the bitinformation should be calculated from the integer representation and only afterwards converted to float by applying the offset parameter and scale factor.
A first step to implement this, could be to raise a warning if the scale
and offset
attribute are found in the metadata of the dataset.
In a second step, the dataset should be reopened without applying the scale
and offset
parameter, which most libraries do per default.
figure from a talk I gave, showing how the bitwise real information content changes when using different binary encodings for the same data
@Ishaanj18 @ayoubft As a benchmark to check that the bitinformation through xbitinfo works well with various formats, I suggest the following:
- High precision data: create U(0,1) data of length 10,000 and sort
x = np.random.rand(10000)
x.sort()
- Convert to Float32, Float16, 16-bit linear packing for various encodings
- analyse the bitinformation in the respective encodings, it should look like this
Float64 has 11 exponent bits, so 3 more than Float32 which has 8 and therefore 3 more than Float16 which has 5. While the information is essentially the same, it's shifted by 3 bits across bitpositions in between formats.
Because we use U(0,1) distributed data, the linear packing's first bit will split the data statistically exactly in half, such that it'll have 1 full bit of information and only for higher bits this will slowly decrease. It's again the same information but shifted to the first bit this time.
Using BitInformation.jl produced with
using BitInformation, LinLogQuantization, PyPlot
x = rand(Float64,10_000)
sort!(x)
f16 = Float16.(x)
f32 = Float32.(x)
u16 = LinQuant16Array(x);
fig,(ax1,ax2,ax3,ax4) = subplots(4,figsize=(8,3))
ax1.imshow((bitinformation(x)[1:32],))
ax1.set_xlim(-0.5,31.5)
ax1.set_yticks([0],["Float64"]) # label x axis
ax1.set_xticks(0:31,vcat("±",[L"e_{%$i}" for i in 1:11], # label sign, exponent and mantissa bits
[L"m_{%$i}" for i in 1:20]),fontsize=7);
ax2.imshow((bitinformation(f32),))
ax2.set_xlim(-0.5,31.5)
ax2.set_yticks([0],["Float32"]) # label x axis
ax2.set_xticks(0:31,vcat("±",[L"e_%$i" for i in 1:8], # label sign, exponent and mantissa bits
[L"m_{%$i}" for i in 1:23]),fontsize=7);
ax3.imshow((bitinformation(f16),))
ax3.set_xlim(-0.5,32)
ax3.set_yticks([0],["Float16"])
ax3.set_xticks(0:15,vcat("±",[L"e_%$i" for i in 1:5], # label sign, exponent and mantissa bits
[L"m_{%$i}" for i in 1:10]),fontsize=7);
ax4.imshow((bitinformation(u16.A),))
ax4.set_xlim(-0.5,32)
ax4.set_yticks([0],["16bit packing"])
ax4.set_xlabel("bits") # label x axis
ax4.set_xticks(0:15,1:16,fontsize=7);
tight_layout()
Hi @observingClouds, I would like to work on this issue. I am new here so can you guide me on how to approach this :)
Hi @AryanBakliwal 👋,
Xbitinfo relies extensively on xarray. The idea here in this issue is to warn the user when the input data has been linearly quantised. This quantisation is typically done by taking all available data and calculate a scale and offset factor to map the data onto a range of integers. The offset
and scale_factor
are often saved as metadata in filetypes like netCDF and zarr which are very common in geosciences. Xarray reads this metadata automatically and applies these factors/offsets when opening such datasets.
>>> import xarray as xr
>>> ds=xr.tutorial.load_dataset("air_temperature")
>>> ds.air.encoding
{'dtype': dtype('int16'),
'source': 'air_temperature.nc',
'original_shape': (2920, 25, 53),
'scale_factor': 0.01}
>>> ds
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 # note the datatype being float32
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
If the scale_factor
and offset
are detected in the input data at minimum a warning should be raised. Further the data could be reopened by keeping the integer values:
>>> ds=xr.tutorial.load_dataset("air_temperature", mask_and_scale=False)
>>> ds.air.encoding
{'dtype': dtype('int16'),
'source': 'air_temperature.nc',
'original_shape': (2920, 25, 53)}
>>> ds.air
...
...,
[29379, 29369, 29509, ..., 29529, 29509, 29469],
[29609, 29689, 29719, ..., 29569, 29569, 29519],
[29769, 29809, 29809, ..., 29649, 29619, 29569]]], dtype=int16). # note the datatype is int16
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
scale_factor: 0.01
You should probably look into explanations of quantisation and read/learn about xbitinfo by reading/viewing the material linked on the README page.