gcp-variant-transforms
gcp-variant-transforms copied to clipboard
Write Avro files.
Beside writing into BigQuery, we need to also support serializing variant records into some binary format. This is useful when the variants are needed to be used in contexts other than BigQuery. Avro format seems to be a good choice.
I don't know much about this, but what about BCF? https://samtools.github.io/bcftools/bcftools.html
Thanks @smrgit for the note. The main idea here is that when a downstream pipeline needs to use Variant Transforms output, it has the option of both reading from the output BigQuery table OR directly use these Avro files. One of the reasons (among others) for choosing Avro is that it mimics the BigQuery row format when reading or writing. In other words, a pipeline can easily switch between reading a BigQuery table or its equivalent Avro output; the same also applies for writing (the difference being mainly the schema).
I have a working version on this branch which shows the sink part.