noodles icon indicating copy to clipboard operation
noodles copied to clipboard

handle serialization of structs?

Open tshauck opened this issue 2 years ago • 5 comments

Hi -- for my line of work I often have to (de-)serialize bioinformatic file formats into more common formats, and I'm curious if there are any recommendations for how to do that with noodles or if someone else has done it... like in an ideal world I could:

use noodles::fasta;

fn main() {
    let d = fasta::record::Definition::new(String::from("seq1"), None);

    let seq = "ATCG".as_bytes().to_vec();
    let s = fasta::record::Record::new(d, seq);

    let s_json = serde_json::to_string(s);
    println!("{}", s_json);
}

I know it's possible add serialization to structs in external packages, but it's a non trivial amount of work, so thought I'd ask either a) if there was a good path to take; b) any thoughts/plans on supporting serialization a la rust-bio.

Thanks!

tshauck avatar Nov 14 '21 17:11 tshauck

I'm so glad to not be the only one really interested in this feature, as discussed a few months ago with @zaeleus:

https://github.com/brainstorm/s3-rust-htslib-bam/commit/9e7a2002e3d31ac40c87bdad59a4af371b26518f#commitcomment-48795221

TL;DR: Seems unlikely to see support in Noodles itself, but most probably as an external (BioSerDe) crate?

That being said, I'd also like to hear about how Michael would architect such a third party crate so that it integrates/performs best with Noodles.

brainstorm avatar Nov 15 '21 03:11 brainstorm

I still think serialization tends to be an application-specific output format, particularly in the two examples given thus far. I'm not even sure if it's viable to generalize, so I'm trying to understand the use-case at the library-level. Following bio::io::bed::Record, is this just wanting struct-level serialization, i.e., a 1-to-1 mapping of the Rust struct fields to the serialization format?

For example, what's the expected (JSON) serialization for a fasta::Record?

{ "name": "seq1 LN:8", "sequence": "ACGT" }
{ "name": "sq1", "description": "LN:8", "sequence": "ACGT" }
{ "definition": { "name": "sq1", "description": "LN:8" }, "sequence": "ACGT" }

How granular does the serialization go for each field and with what vocabulary? E.g., (JSON) serialization possibilities for a sam::record::Cigar:

{ "cigar": "36M4D8S" }
{ "cigar": ["36M", "4D", "8S"] }
{ "cigar": [{ "kind": "M", "len": 36 }, { "kind": "D", "len": 4 }, { "kind": "S", "len": 8 }] }
{ "cigar": [{ "kind": "Match", "len": 36 }, { "kind": "Deletion", "len": 4 }, { "kind": "SoftClip", "len": 8 }] }

Do field names use the spec names/values or noodles API names/values? E.g., (JSON) serialization possibilities for a partial bam::Record:

{ "refId": 0, "mapq": 255 }
{ "reference_sequence_id": 0, "mapping_quality": 255 }
{ "reference_sequence_id": 0, "mapping_quality": null }

This would cause the most problems with interoperability, as most external applications and libraries don't practice the same discipline. If there were a noodles_bed::Record and it serialized to { "reference_sequence_name": "sq0" } instead of { "chrom": "sq0" } like in bio::io::bed::Record, would that be an issue?

I don't have a good solution to the generalization of this problem. There are a lot of open questions that would have to be discussed before moving forward on a decision. It would be helpful to see more concrete examples to better understand the context.

zaeleus avatar Nov 18 '21 20:11 zaeleus

I understand the concerns about not having a consensus through committee standards, but for instance Google has interesting protobuf definitions for most of the bioinfo formats that they are ingesting into their systems and it's fairly straightforward to read and grok as-is:

https://github.com/google/nucleus/tree/v0.6.0/nucleus/protos

If fields are defined on a relatively easy to read "internal representation" schema, the final representation on disk is a bit up to the specific application area (database, parquet, .ORA, etc...) and/or particular use case. In short, a flexible/general internal representation can help out in (de)serializing and match the intended needs.

brainstorm avatar Nov 19 '21 01:11 brainstorm

Alright, let's re-kindle this issue and discussion, since BioSerDe needs it. Also, as @GabrielSimonetto pointed out in his draft PR:

 // In the specific case of Bed, it asks for specifically a bed::Record type.
 //      which is a problem, because it forces our hand to conform
 //      to a specific implementation
 //      whereas being able to receive a generic trait of some sorts
 //      would enable us to pass the IR main struct directly
 //      (maybe we can already rehearse making a noodles PR in that sense
 //          but that would require us to already have the IR definition)

But first, let's address your questions above, Michael:

(...) is this just wanting struct-level serialization, i.e., a 1-to-1 mapping of the Rust struct fields to the serialization format?

Yes!... with some minor intermediate convenient conversions perhaps, but ideally: yes.

For example, what's the expected (JSON) serialization for a fasta::Record?

From a simplicity standpoint and picking from your alternatives, I think that { "name": "sq1", "description": "LN:8", "sequence": "ACGT" } avoids unnecessary data nesting and has good enough granularity to access individual fields without post processing. Also the labels/names are human readable and not too long.

How granular does the serialization go for each field and with what vocabulary? Do field names use the spec names/values or noodles API names/values?

For simplicity's sake, I would:

  1. Choose from your CIGAR alternatives: { "cigar": [{ "kind": "Match", "len": 36 }, { "kind": "Deletion", "len": 4 }, { "kind": "SoftClip", "len": 8 }] }. Being consistent with the previous point: have reasonable granularity to avoid having to post-processing fields later. Output format compression should deal with filesize bloat induced by labels.
  2. Adopt Noodles's (structs?) field naming as a de-facto vocabulary, avoiding any complex ontology work at this stage since it's entirely out of scope.

If there were a noodles_bed::Record and it serialized to { "reference_sequence_name": "sq0" } instead of { "chrom": "sq0" } like in bio::io::bed::Record, would that be an issue?

I'd prefer chrom, shorter and more readable. When in doubt, choosing short (yet still human-readable) names should be the preferred way, as we name variables in code. So consumers (apps) shouldn't find themselves in that situation, I reckon?

It would be helpful to see more concrete examples to better understand the context.

Ok, a fairly straightforward usecase would be to serialize BED to Parquet in order to be queried by Presto, a columnar database that ingests Parquet among other formats that are not hts-spec compliant. Then proceed to SerDe the rest of the bio (*AM/VCF) formats to allow this mode of scalable data exploration on cloud providers or other emerging Rust data science frameworks.

Ultimately, a public interface for BioSerDe should be as usable as the serde crate: implement the two serialize/deserialize methods on the custom serializer trait and have an output format that contains the same information as the input but in a different arrangement.

I hope the objective and overall idea and direction is clearer now?

/cc @multimeric @GabrielSimonetto @E-Allie @mmalenic @ohofmann.

brainstorm avatar Jun 24 '22 05:06 brainstorm

as @GabrielSimonetto pointed out in his draft PR:

I think this is more of a misunderstanding of the format. BED is deceptively simple and doesn't generalize without a tag (e.g., differentiating between BED3+1 and BED4, etc.). The BED implementation in noodles is perhaps unusual and up for a different discussion.

Ok, a fairly straightforward usecase...

Again, this is why I think complex serialization is better suited to the application, not tied to the library, especially when there is no standard. Your example still requires mapping the Parquet schema to the record representation and vice-versa.

I would really like to a see a wrapper and its usage or an implementation of how BioSerDe makes use of Serde.

zaeleus avatar Jun 27 '22 21:06 zaeleus