mzSpecLib
mzSpecLib copied to clipboard
[Pitch] Apache avro serialization
Hi y'all!
I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro. I think it would be a good alternative to json with more efficient disk usage.
https://github.com/jspaezp/avrospeclib
I am still implementing the schema using pydantic and deriving form it the avro schema.
Some disk usage metrics on a reasonably large speclib I have
# ~ 50MB binary speclib file from diann
# 552M tmp/speclib_out.tsv
# 448M tmp/speclib_out.mzlib.json # using mzspeclib
# 148M tests/data/test.mzlib.avro
Read-write speeds
avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.
let me know if there is any interest in adopting it! best!