xbitinfo icon indicating copy to clipboard operation
xbitinfo copied to clipboard

Flagging quantised input data formats

Open observingClouds opened this issue 1 year ago • 5 comments

Description

As discussed with @milankl and @Ishaanj18 the calculation of the bitwise information should be calculated on the 'rounded' bit-representation if rounding has been applied. In case of linear quantisation for example, the bitinformation should be calculated from the integer representation and only afterwards converted to float by applying the offset parameter and scale factor.

A first step to implement this, could be to raise a warning if the scale and offset attribute are found in the metadata of the dataset.

In a second step, the dataset should be reopened without applying the scale and offset parameter, which most libraries do per default.

observingClouds avatar May 19 '23 14:05 observingClouds

figure from a talk I gave, showing how the bitwise real information content changes when using different binary encodings for the same data bigdata 001

milankl avatar May 19 '23 14:05 milankl

@Ishaanj18 @ayoubft As a benchmark to check that the bitinformation through xbitinfo works well with various formats, I suggest the following:

  1. High precision data: create U(0,1) data of length 10,000 and sort
x = np.random.rand(10000)
x.sort()
  1. Convert to Float32, Float16, 16-bit linear packing for various encodings
  2. analyse the bitinformation in the respective encodings, it should look like this

image

Float64 has 11 exponent bits, so 3 more than Float32 which has 8 and therefore 3 more than Float16 which has 5. While the information is essentially the same, it's shifted by 3 bits across bitpositions in between formats.

Because we use U(0,1) distributed data, the linear packing's first bit will split the data statistically exactly in half, such that it'll have 1 full bit of information and only for higher bits this will slowly decrease. It's again the same information but shifted to the first bit this time.

milankl avatar May 30 '23 18:05 milankl

Using BitInformation.jl produced with

using BitInformation, LinLogQuantization, PyPlot

x = rand(Float64,10_000)
sort!(x)

f16 = Float16.(x)
f32 = Float32.(x)
u16 = LinQuant16Array(x);

fig,(ax1,ax2,ax3,ax4) = subplots(4,figsize=(8,3))

ax1.imshow((bitinformation(x)[1:32],))
ax1.set_xlim(-0.5,31.5)
ax1.set_yticks([0],["Float64"])                                                          # label x axis
ax1.set_xticks(0:31,vcat("±",[L"e_{%$i}" for i in 1:11],                           # label sign, exponent and mantissa bits
        [L"m_{%$i}" for i in 1:20]),fontsize=7);    

ax2.imshow((bitinformation(f32),))
ax2.set_xlim(-0.5,31.5)
ax2.set_yticks([0],["Float32"])                                                          # label x axis
ax2.set_xticks(0:31,vcat("±",[L"e_%$i" for i in 1:8],                           # label sign, exponent and mantissa bits
        [L"m_{%$i}" for i in 1:23]),fontsize=7);

ax3.imshow((bitinformation(f16),))
ax3.set_xlim(-0.5,32)
ax3.set_yticks([0],["Float16"])
ax3.set_xticks(0:15,vcat("±",[L"e_%$i" for i in 1:5],                           # label sign, exponent and mantissa bits
        [L"m_{%$i}" for i in 1:10]),fontsize=7);

ax4.imshow((bitinformation(u16.A),))
ax4.set_xlim(-0.5,32)
ax4.set_yticks([0],["16bit packing"])
ax4.set_xlabel("bits")                                                          # label x axis
ax4.set_xticks(0:15,1:16,fontsize=7);

tight_layout()

milankl avatar May 30 '23 18:05 milankl

Hi @observingClouds, I would like to work on this issue. I am new here so can you guide me on how to approach this :)

AryanBakliwal avatar Feb 24 '24 18:02 AryanBakliwal

Hi @AryanBakliwal 👋, Xbitinfo relies extensively on xarray. The idea here in this issue is to warn the user when the input data has been linearly quantised. This quantisation is typically done by taking all available data and calculate a scale and offset factor to map the data onto a range of integers. The offset and scale_factor are often saved as metadata in filetypes like netCDF and zarr which are very common in geosciences. Xarray reads this metadata automatically and applies these factors/offsets when opening such datasets.

>>> import xarray as xr
>>> ds=xr.tutorial.load_dataset("air_temperature")
>>> ds.air.encoding
{'dtype': dtype('int16'),
 'source': 'air_temperature.nc',
 'original_shape': (2920, 25, 53),
 'scale_factor': 0.01}
>>> ds
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7   # note the datatype being float32
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

If the scale_factor and offset are detected in the input data at minimum a warning should be raised. Further the data could be reopened by keeping the integer values:

>>> ds=xr.tutorial.load_dataset("air_temperature", mask_and_scale=False)
>>> ds.air.encoding
{'dtype': dtype('int16'),
 'source': 'air_temperature.nc',
 'original_shape': (2920, 25, 53)}
>>>  ds.air
...
        ...,
        [29379, 29369, 29509, ..., 29529, 29509, 29469],
        [29609, 29689, 29719, ..., 29569, 29569, 29519],
        [29769, 29809, 29809, ..., 29649, 29619, 29569]]], dtype=int16). # note the datatype is int16
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
    scale_factor:  0.01

You should probably look into explanations of quantisation and read/learn about xbitinfo by reading/viewing the material linked on the README page.

observingClouds avatar Feb 24 '24 23:02 observingClouds