dataflow-java icon indicating copy to clipboard operation
dataflow-java copied to clipboard

add support for STRICT intra-shard boundary

Open deflaux opened this issue 9 years ago • 0 comments

When the genomic region we want to examine is divided into shards, we use a STRICT shard boundary to remove duplicate data that would occur at the end of the current shard and also at the beginning of the next shard.

  • This works fine when we are working over an entire chromosome.
  • BUT when we want to shard a subset of a chromosome, we are filtering out the records at the beginning of the very first shard even though they would not be duplicated in any other shards.
    • Some times we do want to make use of those records that overlap the beginning of the shard boundary.
    • We need a way to use OVERLAPS for the first shard and STRICT for all subsequent shards.

Confirm this functionality with a JoinNonVariantSegmentsWithVariants integration test that operates over a small genomic region specified by both normal sharding and SitesToShards.

deflaux avatar Jun 14 '16 20:06 deflaux