mzSpecLib icon indicating copy to clipboard operation
mzSpecLib copied to clipboard

[Pitch] Apache avro serialization

Open jspaezp opened this issue 1 year ago • 4 comments

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro. I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it! best!

jspaezp avatar Jan 05 '24 05:01 jspaezp