seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

stats in machine readable format?

Open tseemann opened this issue 8 years ago • 7 comments

seqkit stat:
File.fastq.gz  FASTQ   DNA   1,091,110  162,522,654       35      149      151

The numbers have commas (,) in them and the results are space formatted to be easy for humans to read.

Do you have an --option to write in TSV without the commas? So scripts can safely use cut,paste,mlr etc?

tseemann avatar Sep 16 '17 00:09 tseemann

I'll add this option

shenwei356 avatar Sep 16 '17 00:09 shenwei356

https://github.com/shenwei356/seqkit/commit/a6aae0fc79092c2ab9eb10b5afac803b44b49246

shenwei356 avatar Sep 16 '17 01:09 shenwei356

available in v0.7.1-dev

shenwei356 avatar Sep 16 '17 02:09 shenwei356

Thank you

tseemann avatar Sep 16 '17 03:09 tseemann

available in v0.7.1-dev

Hello! It would be really nice if you make -T option to be the default behaviour, and add some other option for the output to be formatted in human-readable style (default behaviour currently), or just state more explicitly in the stats -h option that -T also removes commas from the numbers output.

Here is the issue I had and it was exactly due to these commas: For raw reads (in ~1300 fastq.gz files):

seqkit stats data/*fastq.gz | awk 'BEGIN{readcount=0} {readcount+=$4} END{print readcount}'
312000

Strange round number, isn't it? But couldn't read count sum up to round number? "Of course it could. Why not?" - I thought.

Then I merged them together using fastcat (or it could be done with seqkit, doesn't really matter, but using two different tools really made me think that the problem was on fastcat side.., it did not):

fastcat --file=data/summary.txt --histograms=data/hist/ data/ > data/merged.fastq

I ran sekqit one more time and the number of reads miraculously increased!

seqkit stats data/merged.fastq
file                             format  type  num_seqs      sum_len  min_len  avg_len  max_len
data/merged.fastq.gz             FASTQ   DNA    455,359  647,101,583        1  1,421.1   41,105

455359 reads instead of 312000! I became more and more suspicious about that round number of 312000, especially since every other tool i used to double check the results, gave me 455359 count. And then manually inspecting the output of the first command I noticed that read count value for some fastq.gz files was three digits and for some was four digits and also had commas inserted in it! Bingo! Finally it turned out that seqkit produces output with commas inserted every three digits in the values as can be seen in the example earlier. awk processes only the part of the number before the first comma, e.g. 4,333,267 will be treated as simple 4! Hence the decreased number of read counts in the data before merging. What I needed was option -T, but currently the description states -T, --tabular output in machine-friendly tabular format and it is not clear that it also removes commas. My personal opinion is that it is much better to produce the machine-friendly results by default and have human-readable format to be toggled on if necessary, though having such option is also not that important since we have the column command to align everything in a way we like. Please, consider these thoughts. Also thanks for great tool! It helps a lot! Kind regards, Kirill.

Tarasovk49 avatar Jul 03 '25 13:07 Tarasovk49

Dear Kirill, in the very beginning, the stats command was designed to show a simple summary of fasta/q files in a human-readable way (with commas), and later -T was added for downstream analysis. I'm sorry I can't reverse the way it does, as the compatibility needs to be kept for thousands of users.

If you did a small test with a few files before piping it to awk, the problem could be avoided very soon by adding -T.

I see, maybe you did, but with files less than 1000 sequences, so you didn't notice the commas :( .

I'll add some notice in the help message. Sorry for the trouble it caused you.

shenwei356 avatar Jul 03 '25 14:07 shenwei356

No problem at all, just thought it could save some time for future users) Thanks!

Tarasovk49 avatar Jul 03 '25 15:07 Tarasovk49