ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Can a callset be in multiple variant sets?

Open david4096 opened this issue 8 years ago • 6 comments

Callsets can be a member of multiple variant sets according to the schema, yet the reference server is currently underspecified for this case. Is there an example of when a callset is in multiple variant sets?

  /** The IDs of the variant sets this call set has calls in. */
  array<string> variantSetIds = [];

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl#L90

SearchCallSetsRequest requires a single variant set ID to be specified, making the above semantics even more strange. If a callset can be a member of multiple variant sets, why do we specify a single variant set ID when performing search?

record SearchCallSetsRequest {
  /**
  The VariantSet to search.
  */
  string variantSetId;

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variantmethods.avdl#L158

david4096 avatar Mar 21 '16 16:03 david4096

@david4096 Logically, Callsets should only refer to one Variantset, since they can be thought of as an ordered list with the length == no. of variants described. So Variantsets would have to be identical, to be referred from Callsets.

mbaudis avatar Mar 21 '16 16:03 mbaudis

Our group has this use case. Without starting a conversation on what the correct definition of a VariantSet is/should be, the CallSet variantSetId list makes sense if you follow exactly what the definitions suggest:

VariantSet definition:

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

CallSet definition:

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

Use case: compare CallSetX to other CallSets belonging to VariantSetA, and compare CallSetX to other CallSets belong to VariantSetB, but not all CallSets from both VariantSetA and B, since by definition VariantSetA and VariantSetB are not meant to be analyzed together. The CallSet can belong to VariantSetA and VariantSetB in order to avoid duplication of this CallSet.

By this I would say, CallSets belonging to one VariantSet in the reference server is a bug.

jacmarjorie avatar Mar 21 '16 17:03 jacmarjorie

Also,

record SearchCallSetsRequest {
  /**
  The VariantSet to search.
  */
  string variantSetId;

If above is the agreed upon definition, then variantSetId in the CallSetRequest should be variantSetIds list, not a string.

jacmarjorie avatar Mar 21 '16 17:03 jacmarjorie

@jacmarjorie

I believe your use cases is what was imagined. However there is a multi-month discussion, that was never resolved, on if this should be supported:

https://github.com/ga4gh/schemas/pull/395

We would love contributions to the documentation on variants, including documenting use cases justifying the design:

https://github.com/ga4gh/schemas/issues/408 https://github.com/ga4gh/schemas/issues/379

Variants is suffering from no one who has a deep understanding of variants and VCF analysis owning finishing the work.

Mark

Jaclyn Smith [email protected] writes:

Our group has this use case. Without starting a conversation on what the correct definition of a VariantSet is/should be, the CallSet variantSetId list makes sense if you follow exactly what the definitions suggest:

VariantSet definition:

A VariantSet is a collection of variants and variant calls intended to be analyzed together.

CallSet definition:

A CallSet is a collection of calls that were generated by the same analysis of the same sample.

Use case: compare CallSetX to other CallSets belonging to VariantSetA, and compare CallSetX to other CallSets belong to VariantSetB, but not all CallSets from both VariantSetA and B, since by definition VariantSetA and VariantSetB are not meant to be analyzed together. The CallSet can belong to VariantSetA and VariantSetB in order to avoid duplication of this CallSet.

By this I would say, CallSets belonging to one VariantSet in the reference server is a bug.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

diekhans avatar Mar 22 '16 01:03 diekhans

related to https://github.com/ga4gh/schemas/pull/395 and https://github.com/ga4gh/schemas/pull/412

diekhans avatar Apr 05 '16 06:04 diekhans

https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/variants.proto#L75

Callsets are still allowed to be in multiple variant sets. We should remove this. The biosample ID tag on callsets is what allows you to compare calls in multiple variant sets.

david4096 avatar Mar 10 '17 00:03 david4096