dna-seq-varlociraptor
dna-seq-varlociraptor copied to clipboard
Adjust get_read_group for multi sample config.
I have projects where I have to use the same sample in multiple groups. For example I have a lot of single case samples, but the parents are sequenced chunkwise in pools. In that case I write a config which looks like this:
CSDN21 index CSDN21 ILLUMINA NA
CSDN47 motherpool CSDN21 ILLUMINA NA
CSDN52 fatherpool CSDN21 ILLUMINA NA
CSDN22 index CSDN22 ILLUMINA NA
CSDN47 motherpool CSDN22 ILLUMINA NA
CSDN52 fatherpool CSDN22 ILLUMINA NA
CSDN23 index CSDN23 ILLUMINA NA
CSDN47 motherpool CSDN23 ILLUMINA NA
CSDN52 fatherpool CSDN23 ILLUMINA NA
This seems to work just fine for the calling, but for the mapping we have to slightly modify the read_group string generation.
Yes, I've also thought about a comma separated list. But it might be that a single sample might have a different role for different groups. A comma separated list would not be enough in this case, you would also need a comma separated alias list. ... I dont know, i dont know ...
Yes, I've also thought about a comma separated list. But it might be that a single sample might have a different role for different groups. A comma separated list would not be enough in this case, you would also need a comma separated alias list. ... I dont know, i dont know ...
that's a very good point.
What if we instead add another file groups.tsv
for group assignment (while removing the alias and group column from samples.tsv)?
group sample_name alias
CSDN21 CSDN21 index
CSDN21 CSDN47 motherpool
CSDN21 CSDN52 fatherpool
CSDN22 CSDN22 index
CSDN22 CSDN47 motherpool
CSDN22 CSDN52 fatherpool
I think that would better capture the relational nature of such constructs, and maybe also be cleaner, because the tables become less crowded and redundant.