HiTE icon indicating copy to clipboard operation
HiTE copied to clipboard

panHiTE process issues

Open zbh200937 opened this issue 6 months ago • 2 comments

When running the panHiTE pipeline, it often encounters a situation where the process keeps running after the command java-Xmx2g-1art/sdb/zhangbh/biosoftware/HiTE-master/bin/HelitronScanner/HelitronScanner.jar scanTail -lcv_filepath appears, then the pipeline gets stuck and eventually terminates due to timeout. HITE is the new version downloaded on May 27, 2025.

zbh200937 avatar May 28 '25 07:05 zbh200937

Hi @zbh200937,

Could you please share the complete output log? Additionally, I’d like to know the size of the genome. If possible, it would be helpful if you could provide the specific genome that failed to complete.

Best regards, Kang

CSU-KangHu avatar May 28 '25 09:05 CSU-KangHu

I am running the panHiTE pipeline on 22 genomes with a size of 2.7G each. The download website is https://www.tea-pangenome.cn/download/. I restarted the panHiTE pipeline yesterday, and although the process is still ongoing, the same issue has reappeared—the /Helitronscanner.jar process runs for an extended period. In previous attempts, only this /Helitronscanner.jar process ran for a long time without the panHiTE pipeline progressing further, and no error webpage appeared in pipeline_info. I will upload the current top information and the previous nextflow log output. I hope this will be helpful for your work.

.nextflow.log

Image

zbh200937 avatar May 29 '25 00:05 zbh200937

Hi @zbh200937,

Based on the download link you provided, I tested the largest genome file, ZJ.chrom.fasta (3.1 GB). Since panHiTE internally calls HiTE for each genome, I ran HiTE separately on this genome. After approximately 33 hours, the program successfully generated the final result files.

From the top output and your log, it appears that the process is still running without any obvious errors.

Although panHiTE can run on a single node, we strongly recommend using an HPC environment to accelerate the computation, if one is available to you.

Best regards, Kang

CSU-KangHu avatar Jun 03 '25 01:06 CSU-KangHu

Thank you very much for your attempt. I also obtained results by running HiTE separately and identified the cause of the issue in the panHiTE workflow. Due to the thread setting of 100, a large number of processes attempted to transfer data through the pipe, but the pipe buffer became full, leading to a buffer overflow. This caused the processes to hang and stop working, ultimately resulting in the termination of the Nextflow workflow.

zbh200937 avatar Jun 05 '25 09:06 zbh200937

@CSU-KangHu Dear Kang, after rerunning the panHITE pipeline, although the previous issue of process freezing has been alleviated, a new problem has emerged. The HiTE results for individual genes in the pan_run_hite_single folder are incomplete, as shown in Figure 1, and some are even empty, as shown in Figure 2.

Image

Image

zbh200937 avatar Jun 09 '25 00:06 zbh200937

It's possible that the program did not complete successfully. You can check the .nextflow.log file to locate the work_dir corresponding to the problematic genome process. Then, go into that work_dir and examine the log files inside to see if any errors were reported.

CSU-KangHu avatar Jun 09 '25 02:06 CSU-KangHu

Dear kang, I have uploaded two problematic log files. One indicates a missing BLAST database (Step 2), while the other shows that during "Splitting genome assembly into chunks," there were 8 partitions but only two were analyzed. Although I identified the issues from the logs, I don't know how to resolve them and would appreciate your help.

log.zip

zbh200937 avatar Jun 09 '25 05:06 zbh200937

One issue indicates that the BLAST database is missing (Step 2).

Normally, this step uses makeblastdb to generate an index for the genome (genome.fa.clean) in your working directory (/mnt/sdb/zhangbh/pangenmics/workdir/pan_run_hite_single_e11dbde7-795e-4fd5-8d37-836d370f29ad). The fact that the index was not found suggests that either the indexing step did not run properly or the generated files were deleted afterward.

Another issue appears during the "Splitting genome assembly into chunks" step: although 8 partitions were expected, only two were detected.

According to the log, the program only found two partition files (genome.cut0.fa, genome.cut1.fa). It determines the number of partitions by matching filenames in the working directory using a regular expression pattern (genome.cut(\d+).fa$). If you're certain that split_genome_chunks.py created 8 partitions (genome.cut0.fa to genome.cut7.fa), it's worth checking whether there was a problem with file creation or file visibility in the directory.

CSU-KangHu avatar Jun 12 '25 01:06 CSU-KangHu

Although these two tasks were not completed, the panHITE workflow seemed to consider them finished and deleted the temporary working folders, making it difficult for me to trace the origins of these two issues. However, the final results show that pan_split_genome/LTDC.chrom.fasta did generate more than just files 0 and 1 (Figure 1).

Image

In pan_run_hite_single/ZJ.chrom.fasta, although all files were empty, results were still output (Figure 2).

Image

execution_report_2025-06-07_15-44-35.zip

Additionally, the overall panHITE workflow was stopped due to a timeout. I also tested 10 panHITE instances with an average size of 250Mb, and this phenomenon did not occur.

zbh200937 avatar Jun 12 '25 02:06 zbh200937

Perhaps you could modify the code in module/pan_run_hite_single.py responsible for removing the temporary directory to:

if os.path.exists(temp_dir) and debug != 1:
    shutil.rmtree(temp_dir)

Then, when running the command, add the --debug 1 argument. This way, the temporary directory will be retained, which can help with debugging and identifying the source of errors.

CSU-KangHu avatar Jun 12 '25 02:06 CSU-KangHu

Just now I reran the panHiTE pipeline and noticed an issue: the genome size decreased when it entered the temporary working directory. Figure 1 shows its full size,

Image

while Figure 2 shows its size in the temporary working directory.

Image

zbh200937 avatar Jun 12 '25 02:06 zbh200937

That's really strange. As you can see in line 323 of main.py, we use shutil.copy2 to copy the genome file into the temporary directory. In theory, this operation shouldn't change the size of the genome file.

Image

CSU-KangHu avatar Jun 12 '25 02:06 CSU-KangHu

This is indeed very strange; I didn't modify this piece of code,

Image and I also encountered this issue when checking the panHiTE process that timed out last time.

Image

zbh200937 avatar Jun 12 '25 06:06 zbh200937

A small issue: although work_dir is set in the panHiTE workflow, pan_recover_low_copy_TEs still creates temporary working folders in tmp.

Image

Image

zbh200937 avatar Jun 20 '25 06:06 zbh200937