dataflow-java
dataflow-java copied to clipboard
Google Cloud Dataflow pipelines such as Identity-By-State as well as useful utility classes.
When the genomic region we want to examine is divided into shards, we use a STRICT shard boundary to remove duplicate data that would occur at the end of the...
https://github.com/googlegenomics/dataflow-java/blob/master/src/main/java/com/google/cloud/genomics/dataflow/utils/AnnotationUtils.java has no Dataflow dependencies. Move it and any other code with no Dataflow dependencies down into utils-java.
We [manually create data shards](https://github.com/googlegenomics/utils-java/blob/master/src/main/java/com/google/cloud/genomics/utils/ShardUtils.java#L32) right now via `--references` (or `--allReferences`) and `--basesPerShard` from [ShardOptions](https://github.com/googlegenomics/dataflow-java/blob/master/src/main/java/com/google/cloud/genomics/dataflow/utils/ShardOptions.java). Updating to custom sources will allow the data shards to be not only createed but...
The integration tests currently take 32 minutes to complete. Of that: - 8 minutes for the tests that run locally on tiny data - 24 minutes for the tests that...
This class is useful not only for dataflow pipelines but for genomics tools in general so should be moved to utils-java after https://github.com/googlegenomics/dataflow-java/pull/165 is submitted.
**Before:** _Receiving objects: 100% (6743/6743), 121.52 MiB | 210.00 KiB/s, done._ **After:** _Receiving objects: 100% (6421/6421), 36.37 MiB | 210.00 KiB/s, done._ This change has to be force-pushed. Merging does...
When https://github.com/googlegenomics/api-client-java/issues/66 is complete, add a check to VerifyBamId to WARN with a yes/no to proceed if the referenceSets for the reads and the variants do not match.