htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

VCFHeader class should allow multiple "other metadata" lines with the same key to be added

Open droazen opened this issue 9 years ago • 1 comments

The proposed VCF 4.3 spec (https://github.com/samtools/hts-specs/pull/88) adds some clarifications regarding uniqueness of header lines. Specifically, "structured" lines that have their value enclosed with "<>" must have an ID attribute that is unique within their type (contig, filter, etc.), however the actual keys for header lines (this is the part before the first "=" sign, eg., "comment" in "##comment=X" or "contig" in "##contig=X") are not required to be unique.

The current VCFHeader class in htsjdk is not 100% compliant with the above. Specifically, although it does correctly enforce the requirement of unique IDs within each type of "structured" line, and allows multiple lines with the same key but different IDs (eg., multiple "##contig=" lines with different ID values), for unstructured lines with no ID attribute (which are called "other" lines in the code), it incorrectly forbids adding a new line with a duplicate key. This is a consequence of using a Map to store the "other" header lines:

    private final Map<String, VCFHeaderLine> mOtherMetaData = new LinkedHashMap<String, VCFHeaderLine>();

Calling addMetaDataLine() to add one of these "other" header lines with the same key as an existing line will silently refuse to add the new line, since it does a map lookup by key before allowing the line to be added.

The situation is complicated by the fact that in addition to the mOtherMetaData map (used for lookup operations on "other" header lines), a separate copy of all header lines is also stored in a set, and this set is used when the header is actually written out:

    private final Set<VCFHeaderLine> mMetaData = new LinkedHashSet<VCFHeaderLine>();

So a VCFHeader could have (for example) multiple "##comment=" lines, and these will be correctly read into mMetaData and written out again, but lookup operations via getOtherHeaderLine() will return only a single "##comment=" line, and add operations via addMetaDataLine() will silently fail to add additional "##comment=" lines, since these operations use the map rather than the set.

The class should be patched so that:

  1. addMetaDataLine() allows "other metadata" header lines with the same key as existing lines to be added (while still enforcing the requirement of unique ID attributes within each type of "structured" header line)
  2. getOtherHeaderLine() is capable of returning multiple values for a particular key.

And unit tests should be added to confirm the above behavior.

droazen avatar Jul 02 '15 19:07 droazen

This ticket is for @nenewell (unfortunately github won't let me assign it to you explicitly).

droazen avatar Jul 02 '15 19:07 droazen