seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

What is the desired behavior when writing an empty BAM file?

Open tloka opened this issue 3 years ago • 4 comments

Platform

  • SeqAn version: latest
  • Operating system: Ubuntu
  • Compiler: gcc

Question

Just a quick question:
I am using SeqAn3's BAM output. I need to create a seqan3::sam_file_output object even before I know whether there will be valid alignments to write. Thus, in some cases, it can happen that there is no entry being written.

However, the resulting file is actually invalid (at least for samtools), as the header is only written with the first entry. Thus, while the file exists and has a valid EOF marker, samtools returns an error when trying to view such an empty BAM file:

> samtools view [filename]
[main_samview] fail to read the header from "[filename]".

I am using two different samtools version, the error is actually only showing in the newer version (1.10 vs. 1.07). I didn't test the most recent version, but I guess the error is produced by the new HTSLib header which was introduced in samtools 1.10 and performs additional validity checks for the header (see here; Samtools now uses the new HTSlib header API. As this adds more checks for invalid headers, it is possible that some illegal files will now be rejected when they would have been allowed by earlier versions.)

I am wondering whether this is the expected behavior, or whether the header should also be written for empty files to get a valid file as an output!?

Minimal example:

using sam_fields_t = seqan3::fields<seqan3::field::id>;
using sam_format_type_list_t = seqan3::type_list<seqan3::format_sam, seqan3::format_bam>;
using sam_file_output_t = seqan3::sam_file_output<sam_fields_t, sam_format_type_list_t, std::vector<std::string>>;

std::vector<std::string> names {"SEQ1", "SEQ2"};
std::vector<uint16_t> lengths {2463, 8273};
sam_file_output_t out{"/home/tobias/livetools_test/test.bam", names, lengths};
//out.emplace_back("testseq1");  <<< Only including this line would produce the header in the output file

Writing SAM accordingly also produces an empty file without header.

I am wondering if this is expected behavior, as the resulting file seems to not be valid according to samtools. However, I think it should be possible to create a valid empty file, which would also be a nice indicator that the resulting empty file is not due to some error, but does really not contain any alignments.

tloka avatar Mar 25 '21 11:03 tloka

Hi @tloka,

this is currently not possible. Thank you for this use case. When re-discussing the design, I had the feeling that it would make sense to explicitly allow writing the alignment header.

Something along the lines:

seqan3::sam_file_output fout {...};
seqan3::sam_file_header header{...};
seqan3::sam_file_record record1{...}
seqan3::sam_file_record record1{...};

fout.push_back(header); // write out header
fout.push_back(record1); // write out record1
fout.push_back(record2); // write out record2

// this will throw, because header can only be written once
fout.push_back(header);

It seems to be a bam specification violation on our side, because BAM requires a header.

The bottom-line is that in any case we should write a valid empty header at deconstruction if no records were provided.

marehr avatar Mar 25 '21 11:03 marehr

The bottom-line is that in any case we should write a valid empty header at deconstruction if no records were provided.

I agree.

For the rest, I am actually not sure it would be necessary/intuitive to construct the header separately and specify the push_back() function for this. For me using push_back() for something that can only be pushed at the beginning feels a bit like a misuse.

I would rather think of something like a function seqan3::sam_file_output::flush_header() to explicitly write the header to the output file. You could even handle the case that the header was already written internally instead of throwing an exception when trying to push back after records were already written.

Somehing like this:

// Example 1: Header is implicitely written with the first record.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
seqan3::sam_file_record record1{...};
fout.push_back(record1);
}

// Example 2: Header is explicitely flushed. Same result as Example 1.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
fout.flush_header();
seqan3::sam_file_record record1{...};
fout.push_back(record1);
}

// Example 3: Empty output file. Valid empty header needs to be written on destruction.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
}

// Example4: Empty output file. However, the full header is included as it was explicitely flushed.
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
fout.flush_header();
}

// Example5: Invalid call of flush_header() does nothing. Same result as in Example 1 and 2
{
seqan3::sam_file_output fout{"/home/tobias/livetools_test/test.bam", names, lengths};
seqan3::sam_file_record record1{...};
fout.push_back(record1);
fout.flush_header(); // no effect; Header is already flushed with the first record.
}

tloka avatar Mar 25 '21 15:03 tloka

(Agree! We still need some way to modify a header that wasn't constructed before.)

marehr avatar Mar 25 '21 18:03 marehr

Hello, can I ask about an update to this? I'd be happy to work on this myself if nobody else is currently doing so. I'd also like "valid" BAM files written when no records are passed.

joshuak94 avatar Dec 01 '21 09:12 joshuak94

Great to see this resolved, really appreciate it!

tloka avatar Nov 02 '22 09:11 tloka