sift4g icon indicating copy to clipboard operation
sift4g copied to clipboard

Creating local database is always interrupted at aligning step

Open clavedec opened this issue 1 year ago • 1 comments

Hello,

I have been trying to use make-SIFT-db-all.pl to create a database for chiLan. It was all going well, and the files were being created in the directories singleRecords, fasta and subst (the others are empty). However, I constantly get an email saying the slurm job has failed. It says 'Exit code 255', usually after 11h-12h of run at the step of " Aligning queries with candidate sequences ". Last time it advanced until:

** Aligning queries with candidate sequences ** ... processing database part 1 (size ~1.00 GB): 47.50/100.00%

Since all the files had been created, I decided to run:

~/sift4g/bin/sift4g -d /full_path/scripts_to_build_SIFT_db/GCF_009829145.1/protein.faa -q /full_path/scripts_to_build_SIFT_db/all_prot.fasta --subst /full_path/scripts_to_build_SIFT_db/subst --out /full_path/scripts_to_build_SIFT_db/SIFT_predictions --sub-results

But the alignment does not advance beyond 47.50% due to 'Segmentation fault (core dumped)'. Although it seems to be a memory problem, it is using less memory than I allocated for the job. Any suggestion of what can happening?

Based on a previous issue, I'm here sharing the all_prot.fasta and also the config file I used for make-SIFT-db-all.pl on the following link.

Thank you very much for your help!

Best wishes, Clarissa

clavedec avatar Jul 08 '24 13:07 clavedec

Hello,

I encountered the same problem when running the program in the Slurm system. I removed all the abnormal protein codes beforehand. (e.g., X)

I Try:

  1. Increase memory to 1TB (same error)
  2. Remove proteins with sequence lengths over 35,000 from all_prot.fasta. (same error)
  3. Remove proteins with sequence lengths over 15,000 from all_prot.fasta. (no error)
  4. Test sequence lengths greater than 35,000 individually. (same error)

My protein sequence length distribution was: Length range:Numbers of protein 0-8,999:67,873 15,000-15,999: 1 26,000-26,999: 2 35,000-35,999: 1

My guess might be that the chunk is running out of memory allocation. I hope this can help developers give me suggestions to solve the problem of proteins lengths over 15,000 or fix the bug.

Thank you.

Best wishes, Chandler

ChandlerJun avatar Aug 15 '24 05:08 ChandlerJun