mzSpecLib icon indicating copy to clipboard operation
mzSpecLib copied to clipboard

[Pitch] Apache avro serialization

Open jspaezp opened this issue 1 year ago • 4 comments
trafficstars

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro. I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it! best!

jspaezp avatar Jan 05 '24 05:01 jspaezp

Also ... I f you could point me out to an example of a json-resialized clusters and spectrum_attribute_sets I would really appreciate it!

jspaezp avatar Jan 05 '24 16:01 jspaezp

other random thoughts...

Questions:

  1. Why is the 'id' key in the dict object if the key is also that same identifier? ('analytes': {'1': {id = '1', ...}}). Reading the specification file, the rationale is that one could query by id but once read the element 'knows' where it came from ... I am not 100% sure if i aggree with the rationale, since it could be an implementation detail of the reader rather than a constrain on the serialization.
  2. Why are cvparam lists not dicts of cvparams? Maybe dict[id:(values|list[values])] or even dict[id:list[values]] is more efficient than list[dict]? (this follows a similar rationale as the ID)
{
  "1": {
    "id": "1",
    "attributes": [
      {
        "accession": "MS:1003169",
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      {
        "accession": "MS:1001117",
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      {
        "accession": "MS:1000041",
        "name": "charge state",
        "value": 2.0
      }
    ]
  }
}

vs

{
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "name": "proforma peptidoform sequence",
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "name": "theoretical mass",
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "name": "charge state",
        "value": 2.0
      }
    }
  }
}

OR EVEN

.... This would be very efficient in terms of space

{
    "CVParamNames": {
        "MS:1003169": "proforma peptidoform sequence",
        "MS:1001117": "theoretical mass",
        "MS:1000041": "charge state"
    },
... {
  "1": {
    "id": "1",
    "attributes": {
      "MS:1003169": {
        "value": "AAAAAAAAAAAAAAAASAGGK"
      },
      "MS:1001117": {
        "value": 1554.8114178831797
      },
      "MS:1000041": {
        "value": 2.0
      }
    }
  }
}
}

  1. Why are accession ids AND names required? Why not just accession ids? (I can see how names are useful for humans, but I can imagine a 'minized' version of the file that only has accession ids).

  2. Is there any place where there is a suggestion on how to deal with cvparams that encode redundant pieces of information? For example ... the proforma sequence "ASDASD/4" over "ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

jspaezp avatar Jan 05 '24 18:01 jspaezp

Hello, I wanted to make sure you knew this was seen, but haven't had a chance to write a full response, will hopefully respond by this Friday.

mobiusklein avatar Jan 16 '24 15:01 mobiusklein

Why is the 'id' key in the dict object if the key is also that same identifier? ('analytes': {'1': {id = '1', ...}}). Reading the specification file, the rationale is that one could query by id but once read the element 'knows' where it came from ... I am not 100% sure if i aggree with the rationale, since it could be an implementation detail of the reader rather than a constrain on the serialization.

The JSON scheme was made for ergonomics of use, not for size, and you're correct, it's that the Analyte might need to know its own identity value when it's being used, e.g. when figuring out which interpretation members correspond to it. The idea that the user will want to be able to find an analyte by its id is more convenient than saving a few bytes and forcing any consumer to write a linear probing procedure. If you are writing a new serialization, you don't need to preserve the pattern as long as the reader puts the id value in the Analyte object.

Why are cvparam lists not dicts of cvparams? Maybe dict[id:(values|list[values])] or even dict[id:list[values]] is more efficient than list[dict]? (this follows a similar rationale as the ID)

The outer-most container isn't a dict so that we can repeat the same term multiple times. The elements themselves are dict objects because it lets us represent all the parts of the CVParam structure conveniently, with name, accession, value, and group ID visible.

Why are accession ids AND names required? Why not just accession ids? (I can see how names are useful for humans, but I can imagine a 'minized' version of the file that only has accession ids).

The name is again kept for human readability. If you were writing a binary format you wouldn't need to store it, just the accession or something like it to be able to get back to the full CVParam at read time.

Is there any place where there is a suggestion on how to deal with cvparams that encode redundant pieces of information? For example ... the proforma sequence "ASDASD/4" over "ASDASD" + "MS:1000041 = 4"? or is this on a per-use-case basis?.

I don't know if this is explicitly called out in the specification, but I think (and may be wrong) that readers should support and recognize reading both separate and compound values like proforma peptidoform ion notation, and it is an implementation detail of the reader which is given precedence. Charge state in particular is complicated because it is part of both analytes and spectra, and means different things in different contexts, especially because of chimeric spectra.

We don't have any example libraries with clusters in them yet, but I converted one of the SpectraST libraries to JSON. It includes several attribute sets. I think there are still problems with the conversion though, so I'll work on this more in the coming week.

fetal_brain_tiny.mzlib.json

mobiusklein avatar Jan 26 '24 03:01 mobiusklein