salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Salmon auxiliary output

Open Guan-Wang-123 opened this issue 5 years ago • 9 comments

Hi there, I'm wondering how to view the *.gz files (e.g. fld and biases files) in the "aux_info" folder after running Salmon. It doesn't look straightforward to me on how to explore the contents of these files. After unzip the .gz files, they are still not in human readable format. Is another tool required to access these files? Any suggestions are appreciated. Guan.

Guan-Wang-123 avatar Oct 27 '19 17:10 Guan-Wang-123

Hi @Guan-Wang-123 did you find a way to view aux_info? I am using Salmon and would like to visualize that data.

Best regards, Luka

ljudevitluka avatar Apr 26 '20 15:04 ljudevitluka

Hi Luka,

I paste a link here which contains the answer to this question, which I previously also asked on BioStars https://www.biostars.org/p/403647/

I've not been able to try out the solution as suggested by Ahill on BioStars. Please feel free to go ahead.

Guan

Guan-Wang-123 avatar Apr 27 '20 21:04 Guan-Wang-123

I've been meaning to take some time to write parsers for these binary files so that I can create QC visualizations of the positional and GC bias models and fragment length distribution. Subscribing here in case there's a chance to work on this together...

mdshw5 avatar May 04 '20 15:05 mdshw5

Adding this work to multiqc might be a natural starting point. I see there's also code to extract the FLD there right now, but have to admit I do not use multiqc myself.

mdshw5 avatar May 04 '20 15:05 mdshw5

Hi Matt, I think such development will be very helpful for the wider research community. I'd be happy to contribute, although have to admit that I've mainly been using python-based software, R, and bioconductor packages so far to deal with microarray and RNAseq data. If you don't mind a novice contributor, I'd be happy to hear from you on this further. Guan.

Guan-Wang-123 avatar May 15 '20 22:05 Guan-Wang-123

@mdshw5 it looks like multiqc is parsing the text file salmonOutputDir/libParams/flenDist.txt with that FLD code and not the gzipped binary at salmonOutputDir/aux_info/fld.gz.

I would also like to create parsers for these salmon auxiliary files but I haven't found any description about them except for the salmon docs which only state they are in binary format.

@Guan-Wang-123 I followed the suggestion from the response to your biostars post by attempting to encode the binary fld file as ASCII using the Python module base64, and the encoding did not fail but the result did not contain integer counts, which is what the file is supposed to contain according to the salmon docs

mcsimenc avatar Jul 16 '20 21:07 mcsimenc

@mcsimenc I think the documentation is describing a binary representation of 32-bit unsigned integers that can be unpacked using the python struct library. You'd need to know how many integers to expect (1001 by default), and provide this as a format string such as:

import struct
import gzip 
with gzip.open('salmonOutputDir/aux_info/fld.gz') as fld_file:
    fld = struct.unpack('i' * 1001, fld_file.read())

mdshw5 avatar Jul 17 '20 14:07 mdshw5

Thank you, this is exactly what I was missing!

mcsimenc avatar Jul 17 '20 16:07 mcsimenc

@mdshw5 Let me know if I can help out in any way. The idea of getting more metadata from salmon runs for QC purposes is definitely on my radar.

mdshw5 avatar Jul 17 '20 18:07 mdshw5