gcp-variant-transforms icon indicating copy to clipboard operation
gcp-variant-transforms copied to clipboard

Write Avro files.

Open bashir2 opened this issue 5 years ago • 2 comments

Beside writing into BigQuery, we need to also support serializing variant records into some binary format. This is useful when the variants are needed to be used in contexts other than BigQuery. Avro format seems to be a good choice.

bashir2 avatar Oct 30 '18 00:10 bashir2

I don't know much about this, but what about BCF? https://samtools.github.io/bcftools/bcftools.html

smrgit avatar Oct 30 '18 00:10 smrgit

Thanks @smrgit for the note. The main idea here is that when a downstream pipeline needs to use Variant Transforms output, it has the option of both reading from the output BigQuery table OR directly use these Avro files. One of the reasons (among others) for choosing Avro is that it mimics the BigQuery row format when reading or writing. In other words, a pipeline can easily switch between reading a BigQuery table or its equivalent Avro output; the same also applies for writing (the difference being mainly the schema).

I have a working version on this branch which shows the sink part.

bashir2 avatar Nov 03 '18 05:11 bashir2