htsjdk
htsjdk copied to clipboard
Question: Gettting SAMFileHeader from partial files or byte chunks.
Description of the issue:
This is rather a request or a question about creating a file header from an arbitrary chuck of bytes (i.e. a partial file being downloaded from another location). This is not a bug but any comment/suggestion is welcome.
Your environment:
- version of htsjdk: 2.23
- version of java: 11
- which OS: Ubuntu 22.10
Steps to reproduce
In the specific scenario when we have large files for which we need to see the sequenceDictionary from the header object using a SAMFileHeader, is there a way to do it without the need of having the entire file available?
Example:
Suppose we have a 200GB file in a remote location and need to access the file header for some processing. Can I transfer only a range of bytes from my remote location and then use those bytes to get a SAMFileHeader? Using samtools to split the data on the remote server is not possible since that storage can't execute commands, only can retrieve entire files or ranges of them.
Expected behaviour
File header is retrieved by providing a partial file.
Actual behaviour
The file header is retrieved after loading the entire file.
Can I transfer only a range of bytes from my remote location and then use those bytes to get a SAMFileHeader?
yes just open a URL, open a SAMreader using https://www.javadoc.io/doc/com.github.samtools/htsjdk/1.132/htsjdk/samtools/SamReaderFactory.html#open(htsjdk.samtools.SamInputResource)
try(SamReader sr : srf:open(is)) {
SAMFileHeader h = sr.getFileHeader();
}
Thanks @lindenb
I forgot to mention that I have the slightest limitation about the remote file being encrypted, so the most efficient way I have is to download a chunk of bytes and decrypt them before trying to read/build the header object (but downloading everything is not ideal for large files as I said). Otherwise, the recommendation would have been perfect.
@gariem well, you'll only download the first bytes will the method above, not the whole bam.