stats in machine readable format?
seqkit stat:
File.fastq.gz FASTQ DNA 1,091,110 162,522,654 35 149 151
The numbers have commas (,) in them and the results are space formatted to be easy for humans to read.
Do you have an --option to write in TSV without the commas?
So scripts can safely use cut,paste,mlr etc?
I'll add this option
https://github.com/shenwei356/seqkit/commit/a6aae0fc79092c2ab9eb10b5afac803b44b49246
available in v0.7.1-dev
Thank you
available in v0.7.1-dev
Hello!
It would be really nice if you make -T option to be the default behaviour, and add some other option for the output to be formatted in human-readable style (default behaviour currently), or just state more explicitly in the stats -h option that -T also removes commas from the numbers output.
Here is the issue I had and it was exactly due to these commas: For raw reads (in ~1300 fastq.gz files):
seqkit stats data/*fastq.gz | awk 'BEGIN{readcount=0} {readcount+=$4} END{print readcount}'
312000
Strange round number, isn't it? But couldn't read count sum up to round number? "Of course it could. Why not?" - I thought.
Then I merged them together using fastcat (or it could be done with seqkit, doesn't really matter, but using two different tools really made me think that the problem was on fastcat side.., it did not):
fastcat --file=data/summary.txt --histograms=data/hist/ data/ > data/merged.fastq
I ran sekqit one more time and the number of reads miraculously increased!
seqkit stats data/merged.fastq
file format type num_seqs sum_len min_len avg_len max_len
data/merged.fastq.gz FASTQ DNA 455,359 647,101,583 1 1,421.1 41,105
455359 reads instead of 312000! I became more and more suspicious about that round number of 312000, especially since every other tool i used to double check the results, gave me 455359 count. And then manually inspecting the output of the first command I noticed that read count value for some fastq.gz files was three digits and for some was four digits and also had commas inserted in it! Bingo!
Finally it turned out that seqkit produces output with commas inserted every three digits in the values as can be seen in the example earlier. awk processes only the part of the number before the first comma, e.g. 4,333,267 will be treated as simple 4! Hence the decreased number of read counts in the data before merging.
What I needed was option -T, but currently the description states -T, --tabular output in machine-friendly tabular format and it is not clear that it also removes commas.
My personal opinion is that it is much better to produce the machine-friendly results by default and have human-readable format to be toggled on if necessary, though having such option is also not that important since we have the column command to align everything in a way we like.
Please, consider these thoughts.
Also thanks for great tool! It helps a lot!
Kind regards,
Kirill.
Dear Kirill, in the very beginning, the stats command was designed to show a simple summary of fasta/q files in a human-readable way (with commas), and later -T was added for downstream analysis. I'm sorry I can't reverse the way it does, as the compatibility needs to be kept for thousands of users.
If you did a small test with a few files before piping it to awk, the problem could be avoided very soon by adding -T.
I see, maybe you did, but with files less than 1000 sequences, so you didn't notice the commas :( .
I'll add some notice in the help message. Sorry for the trouble it caused you.
No problem at all, just thought it could save some time for future users) Thanks!