EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Plant Annotation

Open amvarani opened this issue 3 years ago • 18 comments

Hi there,

I have made a guide to annotate plant TEs and reports generation, using a modified version of EDTA (full lineage annotation) together with AnnoSINE and mgescan nonltr and TEsorter. I hope this will help for improving the annotation of plant genomes.

link: https://github.com/amvarani/Plant_Annotation_TEs

Best regards

amvarani avatar Dec 02 '22 11:12 amvarani

Hi @amvarani ,

Thank you for your work! It looks comprehensive with great efforts to solve the SINE/LINE annotation. Would you be able to benchmark the annotation performance using the benchmarking pipeline in the EDTA package? I am very curious to learn about the overall performance and in each TE category. If you have the de novo annotation of the rice genome, I can help to benchmark too. Thank you.

Best regards, Shujun

oushujun avatar Dec 03 '22 18:12 oushujun

Hi @oushujun , Yes, let's do it. Should I use the rice genome (TIGR7/MSU7 version) ? I will start the annotation and share the results with you.

Best regards

amvarani avatar Dec 05 '22 12:12 amvarani

Yes, that version of the rice genome is what I used for benchmarking. Thanks!

Shujun

On Mon, Dec 5, 2022 at 7:32 AM Alessandro Varani @.***> wrote:

Hi @oushujun https://github.com/oushujun , Yes, let's do it. Should I use the rice genome (TIGR7/MSU7 version) ? I will start the annotation and share the results with you.

Best regards

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/316#issuecomment-1337266065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHMTFBQO352SDCQ22DWLXOG7ANCNFSM6AAAAAASR2P6OQ . You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Dec 05 '22 15:12 oushujun

Hi there! Here is the file for benchmarking

rice.zip

amvarani avatar Dec 06 '22 21:12 amvarani

@amvarani Thanks! Can you remind me what version of EDTA you were using?

oushujun avatar Dec 07 '22 04:12 oushujun

version v2.0.1

amvarani avatar Dec 07 '22 10:12 amvarani

Hi there Any considerations or comments regarding these results ?

amvarani avatar Dec 11 '22 19:12 amvarani

Sorry for the delay.

I generated the rice annotation using EDTA v2.1.0 without the curated library. I believe your annotation is also free of the curated library. v2.1.0 is basically the same as v2.0.1 with the panEDTA addition. I benchmarked the two for different TE types and levels. The PlantAnnotation_TEs pipeline significantly improved SINE/LINE annotation in terms of sensitivity, but it could be in the sacrifice of TIR sensitivity and bloat of SINE/LINE FDRs. So AnnoSINE and MGEScan-NonLTR could be helpful but will require further filtering to reduce FDR and the erosion of TIR identification.

Category Methods sens spec accu prec FDR F1
ltr PlantAnnotation_TEs 0.882 0.993 0.966 0.976 0.024 0.927
ltr EDTA_denovo 0.919 0.992 0.975 0.974 0.026 0.946
tir PlantAnnotation_TEs 0.496 0.990 0.891 0.923 0.077 0.645
tir EDTA_denovo 0.688 0.915 0.872 0.654 0.346 0.670
helitron PlantAnnotation_TEs 0.890 0.844 0.846 0.179 0.821 0.299
helitron EDTA_denovo 0.778 0.871 0.868 0.194 0.806 0.311
sine PlantAnnotation_TEs 0.640 1.000 0.998 0.867 0.133 0.736
sine EDTA_denovo 0.245 1.000 0.996 1.000 0.000 0.394
line PlantAnnotation_TEs 0.472 0.998 0.988 0.828 0.172 0.601
line EDTA_denovo 0.250 1.000 0.984 0.965 0.035 0.397
mite PlantAnnotation_TEs 0.507 0.970 0.942 0.524 0.476 0.516
mite EDTA_denovo 0.328 0.983 0.940 0.566 0.434 0.415
nonltr PlantAnnotation_TEs 0.513 0.998 0.986 0.849 0.151 0.639
nonltr EDTA_denovo 0.249 1.000 0.980 0.972 0.028 0.396
classified PlantAnnotation_TEs 0.907 0.903 0.905 0.895 0.105 0.901
classified EDTA_denovo 0.933 0.856 0.893 0.855 0.145 0.892
total PlantAnnotation_TEs 0.941 0.892 0.915 0.887 0.113 0.913
total EDTA_denovo 0.937 0.854 0.893 0.853 0.147 0.893

oushujun avatar Jan 01 '23 21:01 oushujun

Hi, thanks for the feedback! I will try to filter more the SINE/LINE annotation and verify the TIR annotation, generating the reports for an additional round of benchmarking.

Best

amvarani avatar Jan 04 '23 12:01 amvarani

@amvarani sounds great! I am happy to iterate.

Best, Shujun

oushujun avatar Jan 05 '23 19:01 oushujun

Dear @oushujun

Here is the rice annotation free of the curated library, now using new filtering rules that tries to improve TIR sensitivity and SINE/LINE FDRs. Could you please check it ?

rice-v2.zip

amvarani avatar Jan 09 '23 17:01 amvarani

I notice several family system is used in this annotation:

BARE-2/Copia -- -- --
Copia/Ale-like 43 63236 0.02% Copia/TAR-like 181 39860 0.01% BARE-2/Gypsy -- -- --
Gypsy/Ogre-like 3 201 0.00% Gypsy/Retand-like 1768 2891569 0.77% Gypsy/Tekay-like 431 322442 0.09% LTR/Copia -- -- --
Copia/Ale 1930 1617025 0.43% Copia/Ale-like 414 242840 0.07% Copia/Angela 137 283380 0.08% Copia/Bianca 958 926840 0.25% Copia/Bianca-like 1023 1556603 0.42% Copia/Ikeros 528 682196 0.18% Copia/Ikeros-like 558 821253 0.22% Copia/Ivana 1966 1435071 0.38% Copia/Ivana-like 105 52823 0.01% Copia/SIRE 1979 2072588 0.56% Copia/SIRE-like 3576 3414747 0.91% Copia/TAR 2290 3356308 0.90% Copia/TAR-like 33 1657 0.00% Copia/Tork 885 666017 0.18% Copia/Tork-like 632 265744 0.07% LTR/Gypsy -- -- --
Gypsy/Athila-like 2223 2618749 0.70% Gypsy/CRM 2471 1920670 0.51% Gypsy/CRM-like 1179 713980 0.19% Gypsy/Ogre 4053 3525101 0.94% Gypsy/Ogre-like 5056 6671544 1.79% Gypsy/Reina 2266 2556588 0.68% Gypsy/Reina-like 1279 1386811 0.37% Gypsy/Retand 8789 12483725 3.34% Gypsy/Retand-like 6806 7812153 2.09% Gypsy/Tekay 5788 8062074 2.16% Gypsy/Tekay-like 3266 3531791 0.95% TR_GAG/Copia -- -- --
Copia/Ale-like 12 7685 0.00% Copia/Bianca-like 56 21760 0.01% Copia/Ikeros-like 432 266374 0.07% Copia/Ivana-like 63 33164 0.01% Copia/SIRE-like 479 244089 0.07% TR_GAG/Gypsy -- -- --
Gypsy/Athila-like 1422 1110339 0.30% Gypsy/CRM-like 1882 1577616 0.42% Gypsy/Ogre-like 461 95233 0.03% Gypsy/Retand-like 659 418080 0.11% Gypsy/Tekay-like 1422 744851 0.20%

Could this be unified?

Shujun

oushujun avatar Jan 11 '23 04:01 oushujun

Hi, Yes, it can be simplified. Please take a look rice-v3.zip

amvarani avatar Jan 11 '23 16:01 amvarani

The v2 annotation has marginal recovery of sensitivity in TIR annotation (0.08%) but lost 3-4% of sensitivity for SINE and LINE and with their FDR decreased 2-3%.

Category Methods sens spec accu prec FDR F1
ltr PlantAnnotation_TEs_v2 0.884 0.992 0.966 0.973 0.027 0.926
tir PlantAnnotation_TEs_v2 0.504 0.986 0.890 0.899 0.101 0.646
helitron PlantAnnotation_TEs_v2 0.881 0.856 0.857 0.191 0.809 0.314
sine PlantAnnotation_TEs_v2 0.607 1.000 0.998 0.889 0.111 0.722
line PlantAnnotation_TEs_v2 0.439 0.999 0.988 0.856 0.144 0.580
mite PlantAnnotation_TEs_v2 0.526 0.969 0.942 0.519 0.481 0.523
nonltr PlantAnnotation_TEs_v2 0.479 0.998 0.986 0.876 0.124 0.619
classified PlantAnnotation_TEs_v2 0.907 0.903 0.905 0.895 0.105 0.901
total PlantAnnotation_TEs_v2 0.939 0.893 0.915 0.889 0.111 0.913

What was your extra filtering to generate v2?

Thanks, Shujun

oushujun avatar Jan 27 '23 05:01 oushujun

Hi, I'm sorry for the long delay to answer

Well, for SINE, I did not modify anything. For LINE, I augmented (~20%) the stringency to identify domains. For TIRs the domains identification parameters were relaxed about 20%. The marginal recovery of sensitivity in TIR annotation is expected since the parameters were changed, but not for SINEs.

I will be working in several different parameters to try to find the best set

amvarani avatar Feb 07 '23 18:02 amvarani

You may need to filter out false SINEs and LINEs in more ways. Such as checking their flanking sequence repeatness and precense of other TEs. Let me know if you have a new set of results.

Best, Shujun

On Tue, Feb 7, 2023 at 1:15 PM Alessandro Varani @.***> wrote:

Hi, I'm sorry for the long delay to answer

Well, for SINE, I did not modify anything. For LINE, I augmented (~20%) the stringency to identify domains. For TIRs the domains identification parameters were relaxed about 20%. The marginal recovery of sensitivity in TIR annotation is expected since the parameters were changed, but not for SINEs.

I will be working in several different parameters to try to find the best set

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/316#issuecomment-1421240147, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHVVMJJHI7QKXSQI6DWWKGMHANCNFSM6AAAAAASR2P6OQ . You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Feb 10 '23 16:02 oushujun

Hi there Here is a new set of results. SINE and LINES were checked I noted that many SINEs are nested with well annotated TIRs elements. Same thing with some LINEs. I removed all nested elements, and also verified the abundance of the other occurrences. here is the file:

rice.fasta.mod.out.gz

amvarani avatar Feb 17 '23 10:02 amvarani

Hi there Here is a new set of results. SINE and LINES were checked I noted that many SINEs are nested with well annotated TIRs elements. Same thing with some LINEs. I removed all nested elements, and also verified the abundance of the other occurrences. here is the file:

rice.fasta.mod.out.gz

Here's the benchmark:

Category Methods sens spec accu prec FDR F1
nonltr PlantAnnotation_TEs_v4 0.367 0.999 0.983 0.890 0.110 0.520
mite PlantAnnotation_TEs_v4 0.519 0.968 0.940 0.508 0.492 0.514
helitron PlantAnnotation_TEs_v4 0.279 0.996 0.963 0.784 0.216 0.412
total PlantAnnotation_TEs_v4 0.732 0.968 0.852 0.957 0.043 0.829
tir PlantAnnotation_TEs_v4 0.507 0.986 0.890 0.898 0.102 0.648
sine PlantAnnotation_TEs_v4 0.476 1.000 0.997 0.911 0.089 0.625
line PlantAnnotation_TEs_v4 0.336 0.999 0.986 0.874 0.126 0.486
ltr PlantAnnotation_TEs_v4 0.817 0.993 0.950 0.975 0.025 0.889
classified PlantAnnotation_TEs_v4 0.704 0.978 0.842 0.969 0.031 0.815

Sensitivity of TIR is not improved but that of SINE and LINE are decreased 10-15% while their FDR is also decreased but just 2-3%. I think you can use the rice standard library to annotate the v2 library and see what category of TEs are the wrong SINEs and LINEs coming from and find a way to remove it.

Thanks! Shujun

oushujun avatar Mar 09 '23 20:03 oushujun