mdsplus icon indicating copy to clipboard operation
mdsplus copied to clipboard

compression algorithm for floating point and >32bit integer

Open zack-vii opened this issue 7 years ago • 9 comments

The current compression algorithm is optimized for integers upto 32 bit. For higher bit values or floating point the compression ration is rather poor. We should implement a shuffle filter that separates the values into chunks of 32 bit maybe even 16.

basically one would cat the data array [1,len] into an array of 16bit integer of the same memory size [a,len] and transpose > [len,a] where a is the ratio (data_bits/16bit)

For large integer: the high bit values would rarely change while for the low bits the usual compression ratio is expected. For floating point: hdf5 does a good job apparently

zack-vii avatar Feb 22 '18 18:02 zack-vii

It would be probably worthwhile exploring better floating point compression. In theory a large percentage of the data stored is in the form of raw transient recorder counts which tend to be 8 to 32 bit length integers. The floating point compression was indeed better with the original VMS floating point bit layout but now with ieee floating point the current compression is poor. It is not clear whether there is sufficient use of > 32-bit integers to warrant much work on improved compression on those yet. There are also better options for compressing images that might be worth investigating. It is imperative that any enhancement to compression algorithms must not invalidate decompression of existing data stored in existing MDSplus data files. All default compression should continue to be lossless compression. At one point we explored adding mechanisms to select alternate compression routines via enhanced node characteristic properties and the representation of compressed data includes the possibility of storing an image and routine to use for decompressing the data but this feature was never utilized to my knowledge.

tfredian avatar Feb 26 '18 13:02 tfredian

Well point of this is that when going to longer discharges you cannot come around segments which atm do not support the raw=expression approach. hence one would store floats or lose the 'you dont need to know anything' approach in reading data. eg by having a RAW node with the raw data and a scaled node with the expression :RAW*:SLOPE+:OFFSET

compressing int64 is a valuable part of compressing timestamps.

the decompression would work the same way as it does now. the "COMPRESSED_DATA_DESCRIPTOR" contains image and method of the used compression. based on metadata of the data one could select different compression methods which would decompress correctly by adjusting the image and method field accordingly. i agree that the automated compression should be lossless at least for integer. for floating point I would allow a relative error < precision if it would improve the compression a lot. but this is open for debate. one could let the user decide if he wants the data in the in question node to be compress lossless or not by setting a node_flag.

possibly it also worth looking in a support for the (value,raw,dim) triplet for segmeted -> implemented via xnci

zack-vii avatar Feb 26 '18 15:02 zack-vii

After fixing the float compression issue for segments by introducing SegmentScale. I think a valuable candidate for compressing int64 could be zlib with level 1. input stream could be

[ t[0], t[1]-t[0], t[2]-t[1], ..., t[N-1]-t[N-2] ]

For a uniform clock, e.g. dt =1000ns, this results in a compression ratio of less than 1% (N >>10000). A noisy uniform clock still achieves ratios of better than 20% (N >>10000, sigma = 10ns), 26% @ sigma = 100ns.

import numpy
import zlib
def test(dt=1000,sigma=0,n=10000,t0=1576543210123456789):
    t = numpy.arange(t0,t0+n*dt,dt,'int64')
    if sigma>0:
        t = t+numpy.random.normal(0,sigma,n).astype('int64')
    ds = t[0].tostring()+(t[1:]-t[:-1]).tostring()
    cs = zlib.compress(ds,1)
    print(len(cs),len(cs)/8./n)

zack-vii avatar Dec 16 '19 00:12 zack-vii

calling zlib as suggested here works very well on all of our data.
This issue was re raised by @kgerickson who has lots of floating point values he wants to put in trees in Korea.

@WhoBrokeTheBuild and @santorofer set up tests where it is the 'user supplied' compressor/decompressor in the call to MdsCompress. The results are good. A bit slower to compress, about the same or faster to decompress and in almost all cases smaller size on disk.

So: How do we control the compression algorithm used ?

  • always use zlib
  • have an environment var to say what to use
    • only want to ask once, or once per tree, or something
  • Default to not compress-on-put
    • add switches to TCL compress
  • Make the choice a DBI
  • Make the choice an NCI

How and when do we decide ?

joshStillerman avatar May 03 '21 19:05 joshStillerman

It would be nice to be able to embed the compression in the model tree somehow. This makes it a lot more transparent to the user and configurable by the admin. Maintaining environment variables consistently is comparatively harder to get right.

zlib is pretty universal, but it would also be a good idea to support other libraries: bz2, xz, etc. I can picture a future where somebody wants to put custom opaque data type X in the tree, and it doesn't compress well with zlib, and it's giant...

There's also the question of compressor settings. Most of them allow you to choose a compression level that gives you a tradeoff between speed and space. Would making this configurable in the model tree be difficult?

kgerickson avatar May 09 '21 04:05 kgerickson

well, compressed data is already having the algorythm in der header of the data, no. if a node is empty/uncomressed and shall be compressed it could traverse up until it reached top or found an xnci COMPRESSION specifying the image and algorithm to use.?

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: kgerickson @.> Sent: Sunday, May 9, 2021 6:16:19 AM To: MDSplus/mdsplus @.> Cc: Timo Schroeder @.>; Assign @.> Subject: Re: [MDSplus/mdsplus] compression algorithm for floating point and >32bit integer (#1237)

It would be nice to be able to embed the compression in the model tree somehow. This makes it a lot more transparent to the user and configurable by the admin. Maintaining environment variables consistently is comparatively harder to get right.

zlib is pretty universal, but it would also be a good idea to support other libraries: bz2, xz, etc. I can picture a future where somebody wants to put custom opaque data type X in the tree, and it doesn't compress well with zlib, and it's giant...

There's also the question of compressor settings. Most of them allow you to choose a compression level that gives you a tradeoff between speed and space. Would making this configurable in the model tree be difficult?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMDSplus%2Fmdsplus%2Fissues%2F1237%23issuecomment-835669615&data=04%7C01%7C%7Ce22ab11be07347160ecc08d912a13292%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637561305807415387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=euBpa%2BNSuKfjwLTdhDU904qSbBPYNi5fjv%2B1bewja6Q%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPRFLU7TWXN7UMP4A6OOILTMYEBHANCNFSM4ER7HP4A&data=04%7C01%7C%7Ce22ab11be07347160ecc08d912a13292%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637561305807425384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FeUwDzBQkuXp%2Bb%2BICn0j0%2BvNS7r0He5VdR4fIOqzfU4%3D&reserved=0.

zack-vii avatar May 09 '21 06:05 zack-vii

I like both yours and Keith's suggestions. I would propose that we use one extra bit in the NCI to tell us whether to do this. Otherwise it will incur an extra IO for every put or potentially severalOn May 9, 2021 2:13 AM, Timo Schroeder @.***> wrote:

well, compressed data is already having the algorythm in der header of the data, no. if a node is empty/uncomressed and shall be compressed it could traverse up until it reached top or found an xnci COMPRESSION specifying the image and algorithm to use.?

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: kgerickson @.***>

Sent: Sunday, May 9, 2021 6:16:19 AM

To: MDSplus/mdsplus @.***>

Cc: Timo Schroeder @.>; Assign @.>

Subject: Re: [MDSplus/mdsplus] compression algorithm for floating point and >32bit integer (#1237)

It would be nice to be able to embed the compression in the model tree

somehow. This makes it a lot more transparent to the user and configurable

by the admin. Maintaining environment variables consistently is

comparatively harder to get right.

zlib is pretty universal, but it would also be a good idea to support other

libraries: bz2, xz, etc. I can picture a future where somebody wants to

put custom opaque data type X in the tree, and it doesn't compress well

with zlib, and it's giant...

There's also the question of compressor settings. Most of them allow you

to choose a compression level that gives you a tradeoff between speed and

space. Would making this configurable in the model tree be difficult?

You are receiving this because you were assigned.

Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMDSplus%2Fmdsplus%2Fissues%2F1237%23issuecomment-835669615&data=04%7C01%7C%7Ce22ab11be07347160ecc08d912a13292%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637561305807415387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=euBpa%2BNSuKfjwLTdhDU904qSbBPYNi5fjv%2B1bewja6Q%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPRFLU7TWXN7UMP4A6OOILTMYEBHANCNFSM4ER7HP4A&data=04%7C01%7C%7Ce22ab11be07347160ecc08d912a13292%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637561305807425384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FeUwDzBQkuXp%2Bb%2BICn0j0%2BvNS7r0He5VdR4fIOqzfU4%3D&reserved=0.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe.

joshStillerman avatar May 09 '21 21:05 joshStillerman

We will use spare2 of the NCI to store compressions id of well known compression algorithms that we ship with MDSplus

zack-vii avatar May 19 '21 16:05 zack-vii

How about an NCI called COMPRESSION_ALGORITM (8 bits) and a TCL command:

set node compression=STANDARD (0)
set node compression=ZLIB (1)

and an addition to dir/full

joshStillerman avatar May 28 '21 13:05 joshStillerman