salmon
salmon copied to clipboard
Salmon auxiliary output
Hi there, I'm wondering how to view the *.gz files (e.g. fld and biases files) in the "aux_info" folder after running Salmon. It doesn't look straightforward to me on how to explore the contents of these files. After unzip the .gz files, they are still not in human readable format. Is another tool required to access these files? Any suggestions are appreciated. Guan.
Hi @Guan-Wang-123 did you find a way to view aux_info? I am using Salmon and would like to visualize that data.
Best regards, Luka
Hi Luka,
I paste a link here which contains the answer to this question, which I previously also asked on BioStars https://www.biostars.org/p/403647/
I've not been able to try out the solution as suggested by Ahill on BioStars. Please feel free to go ahead.
Guan
I've been meaning to take some time to write parsers for these binary files so that I can create QC visualizations of the positional and GC bias models and fragment length distribution. Subscribing here in case there's a chance to work on this together...
Adding this work to multiqc might be a natural starting point. I see there's also code to extract the FLD there right now, but have to admit I do not use multiqc myself.
Hi Matt, I think such development will be very helpful for the wider research community. I'd be happy to contribute, although have to admit that I've mainly been using python-based software, R, and bioconductor packages so far to deal with microarray and RNAseq data. If you don't mind a novice contributor, I'd be happy to hear from you on this further. Guan.
@mdshw5 it looks like multiqc is parsing the text file salmonOutputDir/libParams/flenDist.txt with that FLD code and not the gzipped binary at salmonOutputDir/aux_info/fld.gz.
I would also like to create parsers for these salmon auxiliary files but I haven't found any description about them except for the salmon docs which only state they are in binary format.
@Guan-Wang-123 I followed the suggestion from the response to your biostars post by attempting to encode the binary fld file as ASCII using the Python module base64, and the encoding did not fail but the result did not contain integer counts, which is what the file is supposed to contain according to the salmon docs
@mcsimenc I think the documentation is describing a binary representation of 32-bit unsigned integers that can be unpacked using the python struct library. You'd need to know how many integers to expect (1001 by default), and provide this as a format string such as:
import struct
import gzip
with gzip.open('salmonOutputDir/aux_info/fld.gz') as fld_file:
fld = struct.unpack('i' * 1001, fld_file.read())
Thank you, this is exactly what I was missing!
@mdshw5 Let me know if I can help out in any way. The idea of getting more metadata from salmon runs for QC purposes is definitely on my radar.