dataflow-java
dataflow-java copied to clipboard
add support for STRICT intra-shard boundary
When the genomic region we want to examine is divided into shards, we use a STRICT shard boundary to remove duplicate data that would occur at the end of the current shard and also at the beginning of the next shard.
- This works fine when we are working over an entire chromosome.
- BUT when we want to shard a subset of a chromosome, we are filtering out the records at the beginning of the very first shard even though they would not be duplicated in any other shards.
- Some times we do want to make use of those records that overlap the beginning of the shard boundary.
- We need a way to use OVERLAPS for the first shard and STRICT for all subsequent shards.
Confirm this functionality with a JoinNonVariantSegmentsWithVariants integration test that operates over a small genomic region specified by both normal sharding and SitesToShards.