distiller-nf
distiller-nf copied to clipboard
Filter stats option.
When binning coolers, the user might provide a set of filters to include certain types of pairs into final matrices. However, for stats, there was no such option. In this PR, we propose two additional distiller processes that will allow collecting and merging the stats for the same set of filters.
Disclaimer: this is an option that does not affect the rest of distiller. The addition of two filter+stats processes is controlled by params.stats.use_filters==True, which is set to False by default. The addon is backward-compatible with previous versions of params files.
Ran this on one of my real projects. Here are the lines of two files for the same library until the by-distance cis counts, old file and mapq_30 filtered new.
total 740052127
total_unmapped 43208972
total_single_sided_mapped 99919934
total_mapped 596923221
total_dups 52495969
total_nodups 544427252
cis 419698314
trans 124728938
pair_types/MM 22105219
pair_types/WW 18023654
pair_types/NM 1943706
pair_types/RU 487212
pair_types/MR 2706565
pair_types/NU 17540428
pair_types/NN 1136393
pair_types/DD 52495969
pair_types/UU 543450198
pair_types/MU 68737651
pair_types/NR 10935290
pair_types/UR 489842
total 559048104
total_unmapped 12601339
total_single_sided_mapped 0
total_mapped 546446765
total_dups 48264298
total_nodups 498182467
cis 387357306
trans 110825161
pair_types/DD 48264298
pair_types/WW 12601339
pair_types/RU 438586
pair_types/UU 497302503
pair_types/UR 441378
Some strange differences: different total number and different total_unmapped. 0 total_single_sided_mapped in the new file! Very strange. And I guess related, a lot of pair types are missing... I guess is makes sense with the mapq_30 filter if you think about it, but it's a little confusing that even the total number is affected.
@Phlya Can you provide no_filter stats and full stats file, please?
I suggest checking G1_DMSO_rep2.no_filter.hg38.stats.txt against G1_DMSO_rep2.hg38.stats.txt, not G1_DMSO_rep2.mapq_30.hg38.stats.txt
Ah I didn't have the no_filter in this pipeline, actually! Running it
by the way, the total number of reads is inevitably reduced with mapq30 filter, because pairtools select '(mapq1>=30) and (mapq2>=30)'
will remove the pairs that have at least one segment with lower mapq than the threshold. I guess that's the filter that you run.
Yes. Technically it makes sense, but actually that's not what you'd expect... Or at least not what I would expect.
Can you think of the script or pairtools command that will do what you expect to have in filtered stats then? Maybe marking all the filtered read instead of filtering them out? I'm not sure whether there are existing tools for that, though.
I guess just for the stats purposes we can modify the filter to include all unmapped or ss mapped pairs too? I think then everything above the total_mapped will be the same as without the filter, and that is the most confusing part for me.
Marking would be the best though. Or stats could be modified to apply a filter on the fly and accumulate, in this case, stats at different mapq filtering levels? But all this requires modifying pairtools.
So the cluster was super busy, but finally here is an example of no_filter vs mapq_30 G1_DMSO_2.hg38.mapq_30.stats.txt G1_DMSO_2.hg38.no_filter.stats.txt
Hi, @Phlya Thanks I think the only important thing to check is convergence of no_filter stats against the default output of distiller stats. Then it means that the pairs merging filters run correctly and as expected by design of pipeline. I don't think comparison of mapq_30 filter against no_filter gives us any additional information despite that mapq_30 filter indeed filters out some contacts, which is good, but not enough to test the pipeline update
G1_DMSO_rep2.hg38.stats.txt G1_DMSO_rep2.no_filter.hg38.stats.txt G1_DMSO_rep2.mapq_30.hg38.stats.txt
No filter and no_filter are identical
Thanks, cool!
Interesting counterintuitive thing from your amazing stats: mapq_30 filter results in non-zero "total_unmapped". How come the read can have mapq>30 if it is not mapped?
mapq is bwa mem
output, while the pair type is an output of pairs parser. Parser in some regimes cannot parse complex walks (reporting WW types instead), which are counted as "unmapped" by pairtools stats
So, with filter stats one should interpret only the pair types that are mapped and actually ended up as meaningful contact pairs in coolers. Any other thoughts?
For individual libraries and library groups the file names are inconsistent: one has assembly name before filter, the other filter before assembly.
I didn't realize WW counted as unmapped, btw! That's a little misleading actually, if one wanted to compare different pipelines, for example... Perhaps any pair types that are not rescued should be counted in another category, "unresolved" or something like that. That would also make the output of this new stats a bit more sensible. Not suggesting it as part of this PR of course, just a thought...
This can be removed now, @Phlya ?
I think all this branch can be deleted now as we modified the pairtools stats
to accept filters. @Phlya @golobor , is it okay with you?