EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Does --overwrite 0 recover RepeatModeler in progress?

Open kaede0e opened this issue 2 years ago • 12 comments

Hello, Thanks for developing this comprehensive TE discovery pipeline. We are currently aiming to annotate multiple plant genomes with TEs de novo, which have been taking a lot more computational time than we expected initially. I managed to finish one genome (~220 Mb and ~23% TE content using 8-days CPU time) but I am struggling to finish the pipeline for others. In most of the genomes, it seems to time out in the middle of RepeatModeler step. I tried running RepeatModeler separately to investigate whether that might do the job quicker, and discovered that they have a -recoverDir option to start from where the previous run left off. So, I was wondering if EDTA pipeline can potentially recover results from the RepeatModeler in progress instead of restart from beginning (I've been running the command: EDTA.pl --step final --overwrite 0).

Sincerely, Kaede

kaede0e avatar Jan 26 '22 19:01 kaede0e

Hi Kaede,

EDTA will pick up the RepeatModeler result if its final product $genome.RM.consensi.fa is detected. Otherwise it will run ${repeatmodeler}RepeatModeler -engine ncbi -pa $threads -database $genome.masked 2>/dev/null, which does not contain the -recoverDir option and probably won't recycle unfinished runs. You may try to add this parameter to the command in Line 493 of your EDTA.

Best, Shujun

oushujun avatar Jan 31 '22 15:01 oushujun

Hi Shujun, Oh I see thanks for the clarification, I'll try adding it.

Sincerely, Kaede

kaede0e avatar Jan 31 '22 18:01 kaede0e

Please let me know if it works!

Shujun

oushujun avatar Jan 31 '22 23:01 oushujun

Hi Shujun,

Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there: drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1 drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2 drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3 drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4 -rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5 drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6

The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler.

Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                            Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory RepeatModeler is finished, but no consensi.fa files found.

I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output.

What do you think? I guess the recoverDir extension did not work...

kaede0e avatar Feb 08 '22 17:02 kaede0e

Maybe you want to try this parameter on RepeatModeler first to make sure it will pick up from where it stopped. You can make a copy of the unfinished run and test it. You may find more discussions on their github.

Shujun

On Tue, Feb 8, 2022 at 12:37 PM kaede0e @.***> wrote:

Hi Shujun,

Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there: drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1 drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2 drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3 drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4 -rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5 drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6

The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler.

Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                        Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory RepeatModeler is finished, but no consensi.fa files found.

I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output.

What do you think? I guess the recoverDir extension did not work...

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/252#issuecomment-1032881858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEBEEYBHZRBUIUO6JTU2FIEXANCNFSM5M33KSPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

oushujun avatar Feb 08 '22 17:02 oushujun

Hi Shujun,

The -recoverDir extension does work if I run RepeatModeler separately. The command line looks like this: RepeatModeler -database ${genus}_whole_genome -recoverDir RM_271804.WedFeb21527082022 -pa 1

I was going to try making a copy of this unfinished run and test it but figured that the intermediate files (the line 494: rm $genome.masked.nhr $genome.masked.nin $genome.masked.nnd $genome.masked.nni $genome.masked.nog $genome.masked.nsq) got deleted by the first try I did so I am missing -database argument, and I can't redo it unless I restart from the beginning... Is there a way to retrieve these files or do I need to restart?

Thanks, Kaede

kaede0e avatar Feb 09 '22 17:02 kaede0e

Hi Kaede,

These files should be able to be regenerated by the indexing command: ${repeatmodeler}BuildDatabase -name $genome.masked -engine ncbi $genome.masked;

Best, Shujun

oushujun avatar Feb 10 '22 13:02 oushujun

@kaede0e does it resolved?

oushujun avatar Apr 06 '22 07:04 oushujun

No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource.

kaede0e avatar Apr 06 '22 16:04 kaede0e

You will need the pan-genome method to combine sublibraries to control false positives. Check out this work: https://github.com/HuffordLab/NAM-genomes/tree/master/te-annotation

Shujun

On Wed, Apr 6, 2022 at 9:12 AM kaede0e @.***> wrote:

No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/252#issuecomment-1090451549, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBQ34M3J3UUXGCSGD3VDWZYJANCNFSM5M33KSPA . You are receiving this because you commented.Message ID: @.***>

oushujun avatar Apr 06 '22 16:04 oushujun

Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives?

kaede0e avatar Apr 06 '22 16:04 kaede0e

Each EDTA run will have some sort of FP that can not be fully removed. Most of them are low copy. Combining multiple runs together will inflate these FP and the pan module can effectively control these.

Shujun

On Wed, Apr 6, 2022 at 9:34 AM kaede0e @.***> wrote:

Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/252#issuecomment-1090472650, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEPSPNLTCYBEBJCAZ3VDW4KXANCNFSM5M33KSPA . You are receiving this because you commented.Message ID: @.***>

oushujun avatar Apr 06 '22 16:04 oushujun