basenji icon indicating copy to clipboard operation
basenji copied to clipboard

Clarifications about unmappable genomic regions

Open GrossTor opened this issue 2 years ago • 7 comments

Hello, thanks for developing and maintaining this fascinating project.

I am trying to reproduce some of data pre-processing steps and couldn't figure out how exactly un-mappable genomic regions are handled. I'd like to understand how this is done so I could repeat these steps for other species (e.g. Mouse). The manuscript mentions:

First, we collected blacklist regions from ENCODE and added all RepeatMasker satellite repeats [54], which we found to frequently collect large false positive signal [55]. We further defined unmappable regions of >32 bp where 24-mers align to >10 genomic sites using Umap mappability tracks [56].

In the basenji barnjard I find two files (for human) hg38.blacklist.rep.bed and umap_k24_t10_l32.bed. I am assuming that the first is the combined ENCODE blacklist + RepeatMasker satellite repeats. How was this file created or which external source was it downloaded from? I can only find the "pure" blacklist file. The second file seems to be the file derived from the Umap mappability tracks. But also here I am not sure how it was created. There are two versions of Umap mappability tracks (single-read mappability and multi-read mappability). The first indicates whether a position in the genome overlaps with any uniquely mappable read of length k. The second provides for each position the fraction of uniquely mappable k-mers that overlap this position. But I don't understand what it means to have "24-mers align to >10 genomic sites". Could you explain this or point to code that generated the umap_k24_t10_l32.bed file?

Furthermore, I found this line in the basenji pipeline, which suggests that the two files hg38.blacklist.rep.bed and umap_k24_t10_l32.bed get merged into umap_human.bed. Is this the file whose path you'd pass on to as the '-unmap_bed' argument in basenji_data.py? But that function also has a '-blacklist_bed' argument. Which bed file did you intent to be used here?

Thank you!

GrossTor avatar Jul 25 '22 23:07 GrossTor

Hi, is your objective to perform human or mouse training and understand better how these files were created? Or is your objective to move to a new species and create analogous files for that species?

How was this file created or which external source was it downloaded from? The file is a bedtools merge of the ENCODE blacklist, my own manually constructed blacklist over the years, and a grep of the satellite repeats from UCSC's repeatmasker table.

Could you explain this or point to code that generated the umap_k24_t10_l32.bed file? I began with this https://bismap.hoffmanlab.org/raw/hg38.umap.tar.gz. I then identified intervals with mappability value < 0.1 with length greater than 32 and wrote them out as a BED file. I can send you the script if you'd like.

Is this the file whose path you'd pass on to as the '-unmap_bed' argument in basenji_data.py? Yes. I assume blacklisted regions are also unmappable (due to misasemblies or some other reason). The regions in this BED file will be blocked from having large values. Sequences with a very high proportion of unmappable sequence will be tossed out.

But that function also has a '-blacklist_bed' argument. Which bed file did you intent to be used here? Only the blacklist BED. However, as I'm looking this over, I don't think this one is necessary anymore. These regions are getting clipped, but that will happen anyway later on due to their inclusion in the unmap_bed. Sorry for the confusion; the workflow here has evolved over time with poor software engineering practices.

davek44 avatar Jul 29 '22 19:07 davek44

Thank you for the reply. Everything is much clearer now. Just a few more comments:

is your objective to perform human or mouse training and understand better how these files were created? Or is your objective to move to a new species and create analogous files for that species?

My goal is to do human and mouse training.

I can send you the script if you'd like.

The procedure you described makes sense to me and it is easy enough to reproduce. But it could be a good idea to add that script to the repository in any case. I assume the same procedure can be done for mouse starting from https://bismap.hoffmanlab.org/raw/mm10.umap.tar.gz . But the basenji_barnjard doesn't seem to contain the resulting umap bed file. Is there a specific reason for it? Would you not recommend using Umap for mouse?

GrossTor avatar Aug 01 '22 13:08 GrossTor

No, I recommend the same strategy for mouse. I must've forgotten to upload that file. You can find updated versions of both now here gs://basenji_barnyard2/. I also uploaded my final BED file after merging with the blacklist, too.

(We had to switch gs://basenji_barnyard/ to requester pays due to the cost of training data downloads.)

davek44 avatar Aug 10 '22 23:08 davek44

I then identified intervals with mappability value < 0.1 with length greater than 32 and wrote them out as a BED file. I can send you the script if you'd like.

Hello!Could you please sand me the script? Thank you in advance!

luohaitaofujian avatar Aug 12 '22 06:08 luohaitaofujian

Added the script here https://github.com/calico/basenji/blob/master/bin/unmappable_bed.py

davek44 avatar Sep 03 '22 17:09 davek44

Thank you.

Minor point: the original basenji barnyard included a file called umap_k24_t10_l32.bed, while barnyard2 now contains umap_k36_t10_l32_hg38.bed. I assume that the change from k=24 to k=36 was an intended update and doesn't make a big difference.

GrossTor avatar Sep 21 '22 01:09 GrossTor

Right, should be a little better, but pretty minimal difference.

davek44 avatar Oct 02 '22 19:10 davek44