msprime icon indicating copy to clipboard operation
msprime copied to clipboard

Removing centromeres and telomeres from genome simulations

Open LukeAndersonTrocme opened this issue 2 years ago • 16 comments

Summarizing an email chain as an issue to allow for open discussion of the topic.

Mutations in telomeres and centromeres can lead to misleading results. So it makes sense to remove the trees from these sections of the genome to protect downstream users from shooting themselves in the foot (say, if they threw different sets of mutations onto the trees).

In my current pipeline, I output the entire tree sequence to bcf, then apply the strict mask. I'll note that this mask is specific to this recombination map HapmapII_GRCh37 (download)

When I have a moment I'll try to sift through the mask file to see if I can easily identify telomeres and centromeres. Will post an update to this issue when I make progress.

LukeAndersonTrocme avatar Feb 24 '22 20:02 LukeAndersonTrocme

Thanks @LukeAndersonTrocme - this is an important point when working with real genetic maps.

The best way to do this would be to mark the centromeres and telomeres as missing data in the recombination map before simulating. We sort-of do this already for telomeres in read_hapmap.

There is some code here for manually snipping out centromeres, which could be simplified by using more recent APIs like delete_intervals

jeromekelleher avatar Feb 28 '22 10:02 jeromekelleher

Thanks for the links to the code snippets, it looks like we already have everything we need to remove these regions.

It seems like you're describing two possible options:

Option A

  1. generate centromeres.csv with columns "chrom,start,end" described here
  2. after each chromosome is simulated, we identify the centromere intervals (start, end) from centromeres.csv and use delete_intervals to exclude centromeres.

Option B modify read_hapmap to load the centromeres.csv directly and mark them as nan in the rate_map.

One question I have is, when the rate map has missing data, these regions are effectively excluded from the tree sequence and won't have any sites within those intervals? This would mean that in Option B, we exclude the regions before the simulation, and in Option A we exclude the regions after the simulation?

If this is the case, both A and B would result in the same output tree, although presumably option B might be slightly more efficient..

From a practical standpoint, I think I can handle coding up Option A because it uses existing msprime code and would fit nicely in my existing wrapper function.

I'll wait for input from @sgravel and/or @jeromekelleher before moving forward on this.

LukeAndersonTrocme avatar Feb 28 '22 18:02 LukeAndersonTrocme

Luke, this is also my interpretation. I agree with Jerome that specifying before the simulation is best, since msprime is already set up to take this into account. I am not quite clear to me where the best place is to specify the missingness, presumably it would be after reading in the map, and then directly editing the map object.

We may still have to filter out other repetitive regions in any case to avoid people using the data incorrectly, so I expect your strict mask filtering will come in handy

sgravel avatar Feb 28 '22 22:02 sgravel

Pasting over Jeromes comments about the mask file from the email chain which is relevant here:

You're right about the strict mask, but unfortunately if we apply it to the trees​ (i.e., we mark all the bits in the strict mask as missing data) the size of the files explode. The mask is very "bitty", and we therefore lose all the gains from the tskit format. Chopping out the centromeres and teleomeres is a good compromise, and helps protect downstream users from shooting themselves in the foot (say, if they threw different sets of mutations onto the trees).

to which I replied:

Regarding the rest of the mask file, I think we can get away with providing a link to the mask file in the documentation, and a short explanation of why we recommend using the mask file in downstream analyses. In any case, the pipeline I’m using to generate the PCA/UMAP plots will have details on the extra filtration steps (minor allele frequency cutoff, linkage disequilibrium pruning, etc.) which I expect will be useful for others wanting to work with these data.

LukeAndersonTrocme avatar Feb 28 '22 22:02 LukeAndersonTrocme

I think Luke's plan A is the way to go. We figure out some set of intervals to chop out, and we get rid of them after the simulation (which is helpful, if we've already done the simulations!)

How we get those intervals is the tricky bit. We don't want to use the full strict mask, as this will chop up the returned trees to an unacceptable degree. However, it would be nice if we could get rid of any large repetitive regions like @sgravel says. I wonder if we could do something like this:

  • Filter the mask so that we pull out all masked intervals > (say) 100Kb long, and chop these out of the trees using delete_intervals
  • Subsequently filter out the variant sites that we generate so that none are in the strict mask

It's not so clear that the second step is useful, since it would probably be better to use the mask to remove things that your working on from the real data here, but it's something we can do if we like.

Would this work? Does the strict mask contain any large consecutive intervals that we can chop out?

jeromekelleher avatar Mar 01 '22 09:03 jeromekelleher

Let's go with plan A. I'll start writing it up to remove the centromeres. (It should be easy to include other intervals down the line)

As for the repeat regions, I'm less certain about the best way to identify them. Following Jerome's suggestions to identify long masked intervals, here are a few informative one-liners:

luke$ cat 20140520.strict_mask.autosomes.bed | awk '{print $0, $3-$2}' | awk '$5 >= 10000' | wc -l
    4474
luke$ cat 20140520.strict_mask.autosomes.bed | awk '{print $0, $3-$2}' | awk '$5 >= 20000' | wc -l
      53
luke$ cat 20140520.strict_mask.autosomes.bed | awk '{print $0, $3-$2}' | awk '$5 >= 30000' | wc -l
       0
luke$ cat 20140520.strict_mask.autosomes.bed | awk '{print $0, $3-$2}' | awk '$5 >= 20000' | head
chr1	34814172	34837032	strict 22860
chr1	37392288	37415133	strict 22845
chr1	42036308	42059903	strict 23595
chr1	56517461	56537477	strict 20016
chr1	75232974	75253040	strict 20066
chr1	77030655	77050787	strict 20132
chr1	87801779	87822867	strict 21088
chr1	196271167	196296768	strict 25601
chr2	12887673	12909535	strict 21862
chr2	19561862	19581896	strict 20034

Is it worth getting into the weeds of the mask file just to remove a few 20kb regions? I would not expect these regions to cause a whole lot of issues, especially if we clearly explain in the docs that we recommend using a mask file on the genotype data.. this is a fairly common quality control step that most people would already be familiar with.

LukeAndersonTrocme avatar Mar 01 '22 17:03 LukeAndersonTrocme

I am fine with removing just the centromeres, and we could just use a bed file such as the one described here: https://www.biostars.org/p/2349/. We just need to make sure to mat ch the reference.

sgravel avatar Mar 01 '22 20:03 sgravel

Sounds like a plan! I'll post an update when I have it coded up.

The references do match by the way, we're using the GRCh37/hg19 recombination map (which is the same as in the thread)

LukeAndersonTrocme avatar Mar 01 '22 20:03 LukeAndersonTrocme

Just wanted to share my work to be sure we're on the same page

when simulating the genome I load the centromere csv and extract the intervals

    # identify centromere
    centromeres=pd.read_csv('{}/maps/centromeres.csv'.format(args.dir))
    centromere_intervals=centromeres.loc[centromeres['chrom'] == str(args.chromosome),["start","end"]].values

[smash cut to the simulation]

    # simulation within fixed pedigree
    ts = msprime.sim_ancestry(...)

    # simulation beyond fixed pedigree
    ts = msprime.sim_ancestry(...)

    # drop mutations down the tree
    ts = msprime.sim_mutations(...)

    # remove centromere
    ts = ts.delete_intervals(intervals = centromere_intervals)

after we've converted the ts to a bcf, I then check to see if there are any snps in the centromere

luke$ awk -F',' '$1 == 22' centromeres.csv  
22,12200000,17900000
luke$ bcftools view test_remove_centromere_chr_22_sim.bcf |  wc -l
90239
luke$ bcftools view test_remove_centromere_chr_22_sim.bcf | awk '$2 > 12200000 && $2 < 17900000' | wc -l
0

just to convince myself that I did things correctly, I ran the same code with the ts.delete_intervals() commented out. This yields the following:

luke$ bcftools view test_with_centromere_chr_22_sim.bcf |  wc -l
94931
luke$ bcftools view test_with_centromere_chr_22_sim.bcf | awk '$2 > 12200000 && $2 < 17900000' | wc -l
4692

I think this addresses the issue of removing centromeres. I would move to close this issue unless anyone else has any thoughts on the issue.

LukeAndersonTrocme avatar Mar 01 '22 22:03 LukeAndersonTrocme

Looks perfect to me @LukeAndersonTrocme!

Do we want to worry about the telomeres too? I think they're probably handled automatically because the recombination maps don't start at zero (I think?) and they end before the actual reference length of the chr.

Can you check:

  • Is there missing data at the start of each ts corresponding to the start telomere? (tree = ts.first(); print(tree.interval[1], tree.num_roots())). If num_roots is == num_samples then the tree is marked as missing data. The right interval should correspond roughly to the end of the telomere.
  • Check the sequence_length of the ts against the length in GRCh37.

If the sequence length is not the correct value, we can easily tweak this:

tables = ts.dump_tables()
tables.sequence_length = sequence_length_from_GRCh37
ts_final = tables.tree_sequence()

this will, in effect, insert a final empty tree representing the right telomere.

jeromekelleher avatar Mar 02 '22 07:03 jeromekelleher

Ran the sanity checks @jeromekelleher suggested.

I was a little confused by "If num_roots is == num_samples then the tree is marked as missing data." Is the missing data comment just mean that no samples coalesce at this first position (as is evident from there being the same number of roots and samples)

But the sequence_length is in fact the same as the length in GRCh37. And if my interpretation is correct this means that: 1) the tree sequence retains the information about the chromosome size and 2) has no coalescence (missing data) for the telomeres.

If we're all convinced that the telomeres are correctly handled by read_hapmap then I think we'd be clear to proceed as is. (But in any case, with the way it's currently written, it should be easy to add more intervals to deleted if need be.)

In [1]: import tskit

In [2]: import msprime

In [3]: ts=tskit.load("4882_unrelated_ascending_pedigree_22_sim.ts")

In [4]: tree = ts.first(); print(tree.interval[1], tree.num_roots, tree.num_samples())
16051347.0 9748 9748

In [5]: GRCh37_map = msprime.intervals.RateMap.read_hapmap('../../maps/genetic_map_GRCh37_chr22.txt')

In [6]: print(GRCh37_map.right.max(), ts.sequence_length)
51229805.0 51229805.0

LukeAndersonTrocme avatar Mar 02 '22 17:03 LukeAndersonTrocme

This is great, thanks @LukeAndersonTrocme. Some comments:

I was a little confused by "If num_roots is == num_samples then the tree is marked as missing data."

This is just how we encode missing data in tskit. This documentation should help

It looks like the left centromere is being handled fine, but we may want to handle the right centromere differently. Currently the sequence length is 51229805 but the length of chr22 in GRCh37 is 51304566. All we'd need to do is change the sequence_length on the tree sequence and then we'd have missing data for both the left and right hand side.

Is this worth doing do you think? It would be nice if the sequence_length was the actual lenght of the assembly.

jeromekelleher avatar Mar 03 '22 13:03 jeromekelleher

Good catch! I was reading the sequence_length directly from the RateMap which is already removing the telomeres.

So what I'll do is generate a chromosome_length.csv , (similar to how we did for the centromere.csv). When simulating the genomes, I'll specify the sequence length from that input rather than from the RateMap.

LukeAndersonTrocme avatar Mar 03 '22 15:03 LukeAndersonTrocme

When simulating the genomes, I'll specify the sequence length from that input rather than from the RateMap.

I'm not sure that'll work, I think msprime will complain (maybe it shouldn't). I think this just has to be done as an extra post-processing step.

jeromekelleher avatar Mar 03 '22 16:03 jeromekelleher

Ah, I see. I'm not sure how to modify the chromosome length in post-processing, do you have any suggestions? (or documentation I can read)

LukeAndersonTrocme avatar Mar 03 '22 16:03 LukeAndersonTrocme

tables = ts.dump_tables()
tables.sequence_length = sequence_length_from_assembly
ts_updated = tables.tree_sequence()

jeromekelleher avatar Mar 03 '22 16:03 jeromekelleher