ngless
ngless copied to clipboard
Include file grouping information in fqstats
Using only information contained in a fqstats file it is currently impossible to distinguish between processing pair.1, pair.2 and singles using pairing information paired(..., singles=...) versus treating each file independently fastq(...).
Adding file grouping information could alleviate this issue. Example:
SAMPLE
0:file pair.1.fq.gz
0:encoding Sanger (33 offset)
0:numSeqs 737216
0:numBasepairs 73654175
0:minSeqLen 50
0:maxSeqLen 101
0:gcContent 0.41184101240696
0:filegroup 0 <---
1:file pair.2.fq.gz
... ...
1:filegroup 0 <---
2:file singles.fq.gz
... ...
2:filegroup 0 <--- all above = paired(..., singles=...)
3:file processed.pair.1.fq.gz
... ...
3:filegroup 1 <--- new group
... ...
A similar situation is seen when using load_mocat_sample(...) on a folder that includes multiple pairs/lanes. Here, a variable number of inputs makes parsing the stats file non-trivial.
In this case, and related to https://github.com/ngless-toolkit/ngless/issues/55#issuecomment-358085413 we could treat all the inputs of a sample as the same filegroup.
SAMPLE
0:file SAMPLE/pairA.1.fq.gz
... ...
0:filegroup 0 <---
1:file SAMPLE/pairA.2.fq.gz
... ...
1:filegroup 0 <---
2:file SAMPLE/singlesA.fq.gz
... ...
2:filegroup 0 <---
3:file SAMPLE/pairB.1.fq.gz
... ...
3:filegroup 0 <---
4:file SAMPLE/pairB.2.fq.gz
... ...
4:filegroup 0 <---
5:file SAMPLE/singlesB.fq.gz
... ...
5:filegroup 0 <---
... ...