HiTE icon indicating copy to clipboard operation
HiTE copied to clipboard

How is the annotation performance in the large genome (>10G)

Open haoyongchao opened this issue 1 year ago • 21 comments
trafficstars

I would like to use the pipeline on a large plant genome. Would it be to run separately on chromosomes or directly on the entire genome? Are there any requirements for CPUs and RAM? Have you ever tested it on a large genome? Thanks!!

haoyongchao avatar Jul 09 '24 12:07 haoyongchao

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

CSU-KangHu avatar Jul 09 '24 13:07 CSU-KangHu

Hi @haoyongchao ,

Thank you for using HiTE. While it is possible to run HiTE on the entire genome or on individual chromosomes separately, I recommend running HiTE on the entire genome. Running it on single chromosomes may miss TEs that are distributed across different chromosomes.

Previously, we ran the older version of HiTE on a 4.9 GB wheat genome using 40 CPU cores, which took 2-3 days. In tests with the new version of HiTE, it took 25 hours to process a 2.1 GB maize genome and 10 hours for a 2.6 GB mouse genome. Memory is generally not a limiting factor, but we suggest having 100 GB or more. We haven't tested HiTE on large plant genomes over 10 GB, but you are welcome to try it out. Additionally, if you encounter any issues during the process, we are happy to assist.

Best, Kang

Thank you for your prompt reply. I am running the pipeline on a 10G plant genome using 100 CPUs.

haoyongchao avatar Jul 09 '24 14:07 haoyongchao

Hi, thanks for developing such a great software. When I run it on top of a 9g sized genome, it feels like nothing ever comes out of it, I've been running it since July 30th and it's been at “2024-07-30 02:18:12,685 - main.py[line:389] - INFO: cd /HiTE/module && python3 / HiTE/module/judge_LTR_transposons.py -g /dev/hdd/wangjq/genome/Ago/09.repeat/HiTE/Ago.fasta --ltrharvest_home /HiTE/bin/LTR_HARVEST_ parallel --ltrfinder_home /HiTE/bin/LTR_FINDER_parallel-master -t 24 --tmp_output_dir /dev/hdd/genome/Ago/repeat/HiTE --recover 1 --miu 7e- 09 --use_NeuralTE 1 --is_wicker 0 --NeuralTE_home /HiTE/bin/NeuralTE --TEClass_home /HiTE/classification”. Can you suggest anything?

wjq1981 avatar Aug 16 '24 02:08 wjq1981

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best,
Kang

CSU-KangHu avatar Aug 16 '24 03:08 CSU-KangHu

Hi @wjq1981,

Thank you for using HiTE, and I apologize for the long runtime. Previously, I didn’t recommend splitting contigs because I was concerned it might break the TE sequences. However, considering the efficiency needed for extremely large genomes, some trade-offs in performance might be necessary.

From your output, it appears that HiTE is still stuck in the first stage of LTR search. I suspect your genome contains some particularly long contigs. Since the LTR module processes each contig separately, with one process handling one contig, a very long contig could cause that process to run for an extended period, leaving other threads idle.

Therefore, I suggest splitting your genome into contigs with more balanced lengths to ensure the runtime is more evenly distributed across processes. Since LTRs typically span up to around 20 kb, you should aim to make the contigs long enough to avoid breaking LTRs—perhaps around 10 Mb? This should help fully utilize the remaining idle processes.

I hope you find this suggestion helpful.

Best, Kang

Thank you for your prompt response. I will give it a try.

wjq1981 avatar Aug 16 '24 03:08 wjq1981

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards,
Kang

CSU-KangHu avatar Sep 05 '24 08:09 CSU-KangHu

Hello @haoyongchao and @wjq1981,

I hope you’re well. We’ve noticed that the current version of HiTE might not be ideal for handling large genomes, as it tends to require extensive runtime. To address this, we are optimizing the LTR module to enhance both performance and speed. Could you please share the links to the 10GB and 9GB genomes you used previously? This will help us assess whether the improvements have indeed sped up the LTR module.

Best regards, Kang

Sorry school started today and I'm just now seeing it. The link to it is here.

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Alisma_plantago-aquatica/all_assembly_versions/GCA_963693085.1_laAliPlan1.1/GCA_963693085.1_laAliPlan1.1_genomic.fna.gz

wjq1981 avatar Sep 05 '24 13:09 wjq1981

Hi,

Can I break genome sequences into small pieces (~ 10 M ) and use HiTE to annotate each piece independently, then combined results?

Best, Kun

xiekunwhy avatar Nov 02 '24 02:11 xiekunwhy

Hi @xiekunwhy,

Yes, that's possible. However, I suggest dividing the segments into slightly larger parts, such as 200M or 400M. After merging, you can cluster and remove redundancies across different TE libraries. A straightforward approach is to use cd-hit-est. I recommend using the parameters -aS 0.95 -aL 0.95 -c 0.8/0.95, where 0.8 allows more divergence, and 0.95 is stricter.

Best,
Kang

CSU-KangHu avatar Nov 02 '24 10:11 CSU-KangHu

What about 3.3.1 solves this problem? I ran it and found that there are still some differences from 3.2.0. I'm not quite sure, so I'm asking.

wjq1981 avatar Jan 03 '25 21:01 wjq1981

Hi @wjq1981,

In HiTE v3.3.1, we’ve replaced LTR_retriever with our newly developed tool, FiLTR, which offers more accurate detection of LTR-RTs.

Additionally, we’ve introduced panHiTE, a Nextflow-based tool designed to assist with pan-genome TE analysis. panHiTE is currently under development. If you're interested, you can check out the tutorial at https://github.com/CSU-KangHu/HiTE/wiki/panHiTE-tutorial.

Thank you for providing the 9GB genome assembly. I must apologize, as I forgot to test it earlier—LOL! I’ve just started running it on my computer node and will share the results as soon as the process completes.

Best regards, Kang

CSU-KangHu avatar Jan 04 '25 03:01 CSU-KangHu

Hi @wjq1981,

In HiTE v3.3.1, we’ve replaced LTR_retriever with our newly developed tool, FiLTR, which offers more accurate detection of LTR-RTs.

Additionally, we’ve introduced panHiTE, a Nextflow-based tool designed to assist with pan-genome TE analysis. panHiTE is currently under development. If you're interested, you can check out the tutorial at https://github.com/CSU-KangHu/HiTE/wiki/panHiTE-tutorial.

Thank you for providing the 9GB genome assembly. I must apologize, as I forgot to test it earlier—LOL! I’ve just started running it on my computer node and will share the results as soon as the process completes.

Best regards, Kang

Looking forward to your results! Thank you very much! Very good software!

wjq1981 avatar Jan 04 '25 04:01 wjq1981

Hi @wjq1981,

I just wanted to provide you with an update on the recent run of HiTE with your 9GB genome. There was a minor hiccup during the process. The program was running smoothly, but after 5 days, it failed due to the /tmp directory reaching its 170GB storage limit.

I’ve recently updated the code in the HiTE develop branch to delete large intermediate temporary files after they are no longer needed. This means the analysis will need to be restarted from the beginning for testing.

Preliminary results indicate a significant abundance of LTR-RTs in the genome. I estimate that completing the analysis for your full genome will take approximately 10 days.

Best regards,
Kang

CSU-KangHu avatar Jan 11 '25 05:01 CSU-KangHu

For such a heavy pipeline, I think that one-command design is not a good idea (including other similar pipeline like EDTA https://github.com/oushujun/EDTA , EarlGrey https://github.com/TobyBaril/EarlGrey , ltr_retriever (https://github.com/oushujun/LTR_retriever) , they are all not so efficiently for medium or lage genomes). Independent modules design may be a good choice, for example, allow users to run those program in bin directory (https://github.com/CSU-KangHu/HiTE/tree/master/bin) independently, and then easy to feed those results into next step for classifying and curating. Independent modules design may also eliminate "only one computer node running" limitation, users can use HPC cluster on a slurm, pbs, sge, or clouds computer systems.

Some heavy pipelines like HapHiC (https://github.com/zengxiaofei/HapHiC) , CPhasing (https://github.com/wangyibin/CPhasing) and my bsa pipeline (https://github.com/xiekunwhy/bsa) have elegant modulized design. And the repeat element annotation pipeline I have written in industry (including all TE TR type calling, classifying, consensus and no-redundant library constructing, repeatmasker running and some post-processing) can annotate wheat genome (14Gb) in <2 days (but without accurate curation, I am planning to adopt or rewrite some curation tools or algorithm) on a slurm HPC cluster with several computer nodes .

Just some suggestions, I will be happy if these suggestions may help you.

Best, Kun

xiekunwhy avatar Jan 11 '25 14:01 xiekunwhy

Hi @xiekunwhy,

Thank you very much for your suggestions—you’ve done a lot of great work, and I wish you all the best.

Our design goal is to make HiTE as user-friendly as possible, minimizing the need for manual intervention. For large genomes, improving runtime efficiency is indeed a critical challenge. To address this, HiTE currently offers two solutions:

  1. Users can specify --te_type to run HiTE separately on different nodes for detecting various types of TEs. The results can then be merged and processed together to obtain the final output.
  2. A more user-friendly approach is to use the HiTE Nextflow pipeline, which automatically parallelizes tasks on Slurm HPC platforms, significantly reducing runtime. While this method requires no manual intervention, the current pipeline still processes LTR detection and other TE detections sequentially, as other TE detections rely on LTR detection results for pre-masking the genome to reduce computational load.

As a result, both approaches yield similar performance in terms of final results. After the run is completed, I will also provide the runtime for the Nextflow pipeline to allow for a comparison.

Best regards,
Kang

CSU-KangHu avatar Jan 12 '25 08:01 CSU-KangHu

Hi @wjq1981,

I just wanted to provide you with an update on the recent run of HiTE with your 9GB genome. There was a minor hiccup during the process. The program was running smoothly, but after 5 days, it failed due to the /tmp directory reaching its 170GB storage limit.

I’ve recently updated the code in the HiTE develop branch to delete large intermediate temporary files after they are no longer needed. This means the analysis will need to be restarted from the beginning for testing.

Preliminary results indicate a significant abundance of LTR-RTs in the genome. I estimate that completing the analysis for your full genome will take approximately 10 days.

Best regards, Kang

Thank you very much for your help. I realized that my code has stopped working as well, which may be the problem you are talking about. Looking forward to your develop branch!

wjq1981 avatar Jan 16 '25 08:01 wjq1981

Hi @wjq1981,

I’ve just updated the code. Please download the latest code. You can now follow the tutorial below to run HiTE for large genome annotation:
https://github.com/CSU-KangHu/HiTE/wiki/Running-HiTE-for-Large-Genome-Annotation

Best regards,
Kang

CSU-KangHu avatar Jan 20 '25 09:01 CSU-KangHu

Words have not been sent to express my gratitude! Thank you so much for developing such a powerful software!

wjq1981 avatar Jan 20 '25 11:01 wjq1981

Thank you for your kind reply. I’m delighted to hear that HiTE has been helpful to you!

CSU-KangHu avatar Jan 20 '25 12:01 CSU-KangHu

hi, I’m running the HiTE pipeline using Nextflow and encountered the following error during the OtherTE process:

Traceback (most recent call last): File "/path/to/judge_Other_transposons.py", line 108, in <module> min_TE_len = int(args.min_TE_len) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

It seems the script requires a --min_TE_len argument, but I could not find this parameter documented or defined in the Nextflow pipeline (main.nf) or the default configuration. To work around this, I tried manually adding min_TE_len = 100 in nextflow.config, which resolved the error. However, I’m not sure if this is the correct approach.

Could you please clarify:

    Is --min_TE_len an officially required parameter that should be passed by the pipeline?
    Or does this indicate something went wrong in an earlier step that should have generated this parameter?

Thanks a lot for your help!

Best regards

shandows avatar May 23 '25 13:05 shandows

Hi @shandows, You're absolutely right — your approach is correct. We had modified the parameters but forgot to update them in the Nextflow script. Thanks for pointing it out!

CSU-KangHu avatar May 24 '25 02:05 CSU-KangHu