EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Differences in masking %. RepeatMasker vs EDTA using repeat library produced by EDTA.

Open d00bin opened this issue 2 years ago • 10 comments

Hello, Shujun!

I successfully ran EDTA on a genome with a sensitive setting. In the log file EDTA printed:

TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 16.73%): some_species.fasta.mod.EDTA.TEanno.gff3

Low-threshold TE masking for MAKER gene annotation (masked: 1.00%): some_species.fasta.mod.MAKER.masked

I thought that 1% is a bit too low, so I decided to re-run RepeatMasker with the library produced by EDTA. I used -xsmall -nolow options to later use the soft-masked genome sequence in BRAKER2.

For some reason though RepeatMasker masked 24.43 % this time.

What is the reason for that? Should I be worried about the output?

d00bin avatar Feb 11 '22 13:02 d00bin

Hello,

That's good news! Did you run through the unfinished RepeatModeler run? How Did you maje it?

For masking differences, you may search other issues for similar discussions. Please let me know if you have any other questions.

Best, Shujun

On Fri, Feb 11, 2022 at 8:49 AM d00bin @.***> wrote:

Hello, Shujun!

I successfully ran EDTA on a genome with a sensitive setting. In the log file EDTA printed:

TE annotation using the EDTA library has finished! Check out: Whole-genome TE annotation (total TE: 16.73%): some_species.fasta.mod.EDTA.TEanno.gff3

Low-threshold TE masking for MAKER gene annotation (masked: 1.00%): some_species.fasta.mod.MAKER.masked

I thought that 1% is a bit too low, so I decided to re-run RepeatMasker with the library produced by EDTA. I used -xsmall -nolow options to later use the soft-masked genome sequence in BRAKER2.

For some reason though RepeatMasker masked 24.43 % this time.

What is the reason for that? Should I be worried about the output?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/254, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCXZROBNZHDGG2JB23U2UHWZANCNFSM5OEKPDZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

oushujun avatar Feb 11 '22 14:02 oushujun

Did you run through the unfinished RepeatModeler run? How Did you maje it?

Actually, I just re-ran the final step with --step final option and it worked this time.

For masking differences, you may search other issues for similar discussions.

I can't really find related topics in the issues. I saw some where people wonder about the differences between RepeatMasker with RM libraries and EDTA de-novo.

But I first created the repeat library with EDTA, and then used it to mask the genome. For some reason I get different results (16.73% from EDTA sum vs 24.43 % from RepeatMasker out), even though I assumed that EDTA used RepeatMasker to mask the genome in the end. Am I correct about this assumption?

I just would like to have a general statistics about repeat content and I don't know which number to trust 16.73% or 24.43 %.

d00bin avatar Feb 14 '22 15:02 d00bin

Hello @d00bin,

Sorry for the delay. First of all, Low-threshold TE masking for MAKER gene annotation (masked: 1.00%): some_species.fasta.mod.MAKER.masked this information says the file some_species.fasta.mod.MAKER.masked is for MAKER gene annotation, not representing the actual TE content. Is it confusing?

If everything ran without error, the EDTA sum file represents what the program believes the TE content of the genome.

16.73% from EDTA sum vs 24.43 % from RepeatMasker out

This does represent some significant differences. Can you paste here the Repeatmasker command you were using?

Shujun

oushujun avatar Feb 28 '22 05:02 oushujun

@d00bin do these issues resolved?

oushujun avatar Apr 06 '22 07:04 oushujun

@oushujun Dear Shujun, I'm terribly sorry for such a delayed response!

Nope the issue is still there.

The command I used for RepeatMasker is:

RepeatMasker \
-a -gff -pa 32 -u \
-dir final_RepeatMasker_out \
-xsmall \
-nolow \
-lib /path/to/EDTA/library.fasta.mod.EDTA.TElib.fa \
genome_chromosomelevel.fasta

Low-threshold TE masking for MAKER gene annotation (masked: 1.00%): some_species.fasta.mod.MAKER.masked this information says the file some_species.fasta.mod.MAKER.masked is for MAKER gene annotation, not representing the actual TE content. Is it confusing?

And this I understand, yes.

d00bin avatar Apr 06 '22 07:04 d00bin

Is this a non-plant?

On Wed, Apr 6, 2022 at 12:49 AM d00bin @.***> wrote:

@oushujun https://github.com/oushujun Dear Shujun, I'm terribly sorry for such a delayed response!

Nope the issue is still there.

The command I used for RepeatMasker is:

RepeatMasker
-a -gff -pa 32 -u
-dir final_RepeatMasker_out
-xsmall
-nolow
-lib /path/to/EDTA/library.fasta.mod.EDTA.TElib.fa
genome_chromosomelevel.fasta

Low-threshold TE masking for MAKER gene annotation (masked: 1.00%): some_species.fasta.mod.MAKER.masked this information says the file some_species.fasta.mod.MAKER.masked is for MAKER gene annotation, not representing the actual TE content. Is it confusing?

And this I understand, yes.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/254#issuecomment-1089941391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEIL6FLLCLBXPKXENDVDU6YRANCNFSM5OEKPDZA . You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Apr 06 '22 08:04 oushujun

Is this a non-plant?

Yes. It's a teleost fish. There is a genome of a sister species, from the same genus, published and the repeat content is ~23%. Also, previously I used this workflow to produce a repeat library for my genome, and it ended up around ~23%. But EDTA is a much more elegant solution than what I used before.

d00bin avatar Apr 06 '22 08:04 d00bin

That make sense to me. If EDTA produced no error, then it's running as expected.

On Wed, Apr 6, 2022 at 1:23 AM d00bin @.***> wrote:

Is this a non-plant?

Yes. It's a teleost fish. There is a genome of a sister species, from the same genus, published and the repeat content is ~23%. Also, previously I used this https://github.com/uio-cels/Repeats workflow to produce a repeat library for my genome, and it ended up around ~23%. But EDTA is a much more elegant solution than what I used before.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/254#issuecomment-1089984817, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEZX7ZGTB2HEZVD5IDVDVCZRANCNFSM5OEKPDZA . You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Apr 06 '22 15:04 oushujun

That make sense to me. If EDTA produced no error, then it's running as expected.

So this difference between 16.73% from EDTA sum vs 24.43 % from RepeatMasker out is due to my repeat masker settings? And if yes, then what should I consider as "true" TE content of the genome?

d00bin avatar Apr 06 '22 15:04 d00bin

You may want to manually collect some SINE LINE sequences and give it to EDTA. These Could be missed.

Shujun

On Wed, Apr 6, 2022 at 8:26 AM d00bin @.***> wrote:

That make sense to me. If EDTA produced no error, then it's running as expected.

So this difference between 16.73% from EDTA sum vs 24.43 % from RepeatMasker out is due to my repeat masker settings? And if yes, then what should I consider as "true" TE content of the genome?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/254#issuecomment-1090402688, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFOPKDZ5XI7PFILYVTVDWULLANCNFSM5OEKPDZA . You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Apr 06 '22 16:04 oushujun