distiller-nf Filter stats option.

When binning coolers, the user might provide a set of filters to include certain types of pairs into final matrices. However, for stats, there was no such option. In this PR, we propose two additional distiller processes that will allow collecting and merging the stats for the same set of filters.

Disclaimer: this is an option that does not affect the rest of distiller. The addition of two filter+stats processes is controlled by params.stats.use_filters==True, which is set to False by default. The addon is backward-compatible with previous versions of params files.

Jan 27 '22 11:01 agalitsyna

Ran this on one of my real projects. Here are the lines of two files for the same library until the by-distance cis counts, old file and mapq_30 filtered new.

total	740052127
total_unmapped	43208972
total_single_sided_mapped	99919934
total_mapped	596923221
total_dups	52495969
total_nodups	544427252
cis	419698314
trans	124728938
pair_types/MM	22105219
pair_types/WW	18023654
pair_types/NM	1943706
pair_types/RU	487212
pair_types/MR	2706565
pair_types/NU	17540428
pair_types/NN	1136393
pair_types/DD	52495969
pair_types/UU	543450198
pair_types/MU	68737651
pair_types/NR	10935290
pair_types/UR	489842

total	559048104
total_unmapped	12601339
total_single_sided_mapped	0
total_mapped	546446765
total_dups	48264298
total_nodups	498182467
cis	387357306
trans	110825161
pair_types/DD	48264298
pair_types/WW	12601339
pair_types/RU	438586
pair_types/UU	497302503
pair_types/UR	441378

Some strange differences: different total number and different total_unmapped. 0 total_single_sided_mapped in the new file! Very strange. And I guess related, a lot of pair types are missing... I guess is makes sense with the mapq_30 filter if you think about it, but it's a little confusing that even the total number is affected.

Jan 28 '22 08:01 Phlya

@Phlya Can you provide no_filter stats and full stats file, please?

Jan 28 '22 08:01 agalitsyna

G1_DMSO_rep2.hg38.stats.txt G1_DMSO_rep2.mapq_30.hg38.stats.txt Sure!

Jan 28 '22 08:01 Phlya

I suggest checking G1_DMSO_rep2.no_filter.hg38.stats.txt against G1_DMSO_rep2.hg38.stats.txt, not G1_DMSO_rep2.mapq_30.hg38.stats.txt

Jan 28 '22 08:01 agalitsyna

Ah I didn't have the no_filter in this pipeline, actually! Running it

Jan 28 '22 08:01 Phlya

by the way, the total number of reads is inevitably reduced with mapq30 filter, because pairtools select '(mapq1>=30) and (mapq2>=30)' will remove the pairs that have at least one segment with lower mapq than the threshold. I guess that's the filter that you run.

Jan 28 '22 08:01 agalitsyna

Yes. Technically it makes sense, but actually that's not what you'd expect... Or at least not what I would expect.

Jan 28 '22 08:01 Phlya

Can you think of the script or pairtools command that will do what you expect to have in filtered stats then? Maybe marking all the filtered read instead of filtering them out? I'm not sure whether there are existing tools for that, though.

Jan 28 '22 09:01 agalitsyna

I guess just for the stats purposes we can modify the filter to include all unmapped or ss mapped pairs too? I think then everything above the total_mapped will be the same as without the filter, and that is the most confusing part for me.

Marking would be the best though. Or stats could be modified to apply a filter on the fly and accumulate, in this case, stats at different mapq filtering levels? But all this requires modifying pairtools.

Jan 28 '22 09:01 Phlya

So the cluster was super busy, but finally here is an example of no_filter vs mapq_30 G1_DMSO_2.hg38.mapq_30.stats.txt G1_DMSO_2.hg38.no_filter.stats.txt

Jan 29 '22 08:01 Phlya

Hi, @Phlya Thanks I think the only important thing to check is convergence of no_filter stats against the default output of distiller stats. Then it means that the pairs merging filters run correctly and as expected by design of pipeline. I don't think comparison of mapq_30 filter against no_filter gives us any additional information despite that mapq_30 filter indeed filters out some contacts, which is good, but not enough to test the pipeline update

Jan 29 '22 09:01 agalitsyna

G1_DMSO_rep2.hg38.stats.txt G1_DMSO_rep2.no_filter.hg38.stats.txt G1_DMSO_rep2.mapq_30.hg38.stats.txt

Jan 29 '22 09:01 Phlya

No filter and no_filter are identical

Jan 29 '22 09:01 Phlya

Thanks, cool! Interesting counterintuitive thing from your amazing stats: mapq_30 filter results in non-zero "total_unmapped". How come the read can have mapq>30 if it is not mapped? mapq is bwa mem output, while the pair type is an output of pairs parser. Parser in some regimes cannot parse complex walks (reporting WW types instead), which are counted as "unmapped" by pairtools stats

So, with filter stats one should interpret only the pair types that are mapped and actually ended up as meaningful contact pairs in coolers. Any other thoughts?

Jan 29 '22 09:01 agalitsyna

For individual libraries and library groups the file names are inconsistent: one has assembly name before filter, the other filter before assembly.

Jan 29 '22 18:01 Phlya

I didn't realize WW counted as unmapped, btw! That's a little misleading actually, if one wanted to compare different pipelines, for example... Perhaps any pair types that are not rescued should be counted in another category, "unresolved" or something like that. That would also make the output of this new stats a bit more sensible. Not suggesting it as part of this PR of course, just a thought...

Jan 31 '22 10:01 Phlya

This can be removed now, @Phlya ?

Nov 08 '22 00:11 agalitsyna

I think all this branch can be deleted now as we modified the pairtools stats to accept filters. @Phlya @golobor , is it okay with you?

Feb 10 '23 22:02 agalitsyna

distiller-nf distiller-nf copied to clipboard

Filter stats option.

distiller-nf
distiller-nf copied to clipboard