basenji
basenji copied to clipboard
Clarifications about unmappable genomic regions
Hello, thanks for developing and maintaining this fascinating project.
I am trying to reproduce some of data pre-processing steps and couldn't figure out how exactly un-mappable genomic regions are handled. I'd like to understand how this is done so I could repeat these steps for other species (e.g. Mouse). The manuscript mentions:
First, we collected blacklist regions from ENCODE and added all RepeatMasker satellite repeats [54], which we found to frequently collect large false positive signal [55]. We further defined unmappable regions of >32 bp where 24-mers align to >10 genomic sites using Umap mappability tracks [56].
In the basenji barnjard I find two files (for human) hg38.blacklist.rep.bed and umap_k24_t10_l32.bed. I am assuming that the first is the combined ENCODE blacklist + RepeatMasker satellite repeats. How was this file created or which external source was it downloaded from? I can only find the "pure" blacklist file. The second file seems to be the file derived from the Umap mappability tracks. But also here I am not sure how it was created. There are two versions of Umap mappability tracks (single-read mappability and multi-read mappability). The first indicates whether a position in the genome overlaps with any uniquely mappable read of length k. The second provides for each position the fraction of uniquely mappable k-mers that overlap this position. But I don't understand what it means to have "24-mers align to >10 genomic sites". Could you explain this or point to code that generated the umap_k24_t10_l32.bed file?
Furthermore, I found this line in the basenji pipeline, which suggests that the two files hg38.blacklist.rep.bed and umap_k24_t10_l32.bed get merged into umap_human.bed. Is this the file whose path you'd pass on to as the '-unmap_bed' argument in basenji_data.py? But that function also has a '-blacklist_bed' argument. Which bed file did you intent to be used here?
Thank you!
Hi, is your objective to perform human or mouse training and understand better how these files were created? Or is your objective to move to a new species and create analogous files for that species?
How was this file created or which external source was it downloaded from? The file is a bedtools merge of the ENCODE blacklist, my own manually constructed blacklist over the years, and a grep of the satellite repeats from UCSC's repeatmasker table.
Could you explain this or point to code that generated the umap_k24_t10_l32.bed file? I began with this https://bismap.hoffmanlab.org/raw/hg38.umap.tar.gz. I then identified intervals with mappability value < 0.1 with length greater than 32 and wrote them out as a BED file. I can send you the script if you'd like.
Is this the file whose path you'd pass on to as the '-unmap_bed' argument in basenji_data.py? Yes. I assume blacklisted regions are also unmappable (due to misasemblies or some other reason). The regions in this BED file will be blocked from having large values. Sequences with a very high proportion of unmappable sequence will be tossed out.
But that function also has a '-blacklist_bed' argument. Which bed file did you intent to be used here? Only the blacklist BED. However, as I'm looking this over, I don't think this one is necessary anymore. These regions are getting clipped, but that will happen anyway later on due to their inclusion in the unmap_bed. Sorry for the confusion; the workflow here has evolved over time with poor software engineering practices.
Thank you for the reply. Everything is much clearer now. Just a few more comments:
is your objective to perform human or mouse training and understand better how these files were created? Or is your objective to move to a new species and create analogous files for that species?
My goal is to do human and mouse training.
I can send you the script if you'd like.
The procedure you described makes sense to me and it is easy enough to reproduce. But it could be a good idea to add that script to the repository in any case. I assume the same procedure can be done for mouse starting from https://bismap.hoffmanlab.org/raw/mm10.umap.tar.gz . But the basenji_barnjard doesn't seem to contain the resulting umap bed file. Is there a specific reason for it? Would you not recommend using Umap for mouse?
No, I recommend the same strategy for mouse. I must've forgotten to upload that file. You can find updated versions of both now here gs://basenji_barnyard2/. I also uploaded my final BED file after merging with the blacklist, too.
(We had to switch gs://basenji_barnyard/ to requester pays due to the cost of training data downloads.)
I then identified intervals with mappability value < 0.1 with length greater than 32 and wrote them out as a BED file. I can send you the script if you'd like.
Hello!Could you please sand me the script? Thank you in advance!
Added the script here https://github.com/calico/basenji/blob/master/bin/unmappable_bed.py
Thank you.
Minor point: the original basenji barnyard included a file called umap_k24_t10_l32.bed, while barnyard2 now contains umap_k36_t10_l32_hg38.bed. I assume that the change from k=24 to k=36 was an intended update and doesn't make a big difference.
Right, should be a little better, but pretty minimal difference.