openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors

Open juliocesar-io opened this issue 5 months ago • 0 comments

Background

When running inference with run_pretrained_openfold.py and using precomputed alignments, the parse_fasta function is partially extracting the FASTA tag/ID from the original ID used to generate the alignments output folder. It removes special characters, such as hyphens (-) or periods (.), which are often used in FASTA IDs.

This causes the inference to fail, as the partially extracted ID does not match the alignments folder.

For example, if you have a FASTA file like this:

>my-fasta-sequence
AABBCC

Then, after running the precompute_alignments.py script, the following alignments are generated (as expected):

├── input
│   └── fasta_dir
│       └── my-fasta-sequence.fasta
├── output
│   ├── alignments
│   │   └── my-fasta-sequence
│   │       ├── bfd_uniclust_hits.a3m
│   │       ├── hhsearch_output.hhr
│   │       ├── mgnify_hits.sto
│   │       └── uniref90_hits.sto

However, when you run the run_pretrained_openfold.py script with the --use_precomputed_alignments flag, you will encounter the following error:

Traceback (most recent call last):
  File "/opt/openfold/run_pretrained_openfold.py", line 499, in <module>
    main(args)
  File "/opt/openfold/run_pretrained_openfold.py", line 299, in main
    feature_dict = generate_feature_dict(
  File "/opt/openfold/run_pretrained_openfold.py", line 151, in generate_feature_dict
    feature_dict = data_processor.process_fasta(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 883, in process_fasta
    hits = self._parse_template_hit_files(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 795, in _parse_template_hit_files
    for f in os.listdir(alignment_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/run_path/output/alignments/my'

Fix

The error occurs because of the truncation performed by parse_fasta, causing it to look for "my" instead of the expected "my-fasta-sequence". I have updated the parse_fasta function to fix this issue.

Previously, the part of the code that split the IDs using the regex (re.split('\W|\|', t)) was cutting off parts of the ID. For the workflow using precomputed alignments to function correctly, the full ID must be preserved so that it matches the folder.

Changes:

  • Each entry is now split into the tag (header) and the sequence, while preserving the entire header.
  • The regex splitting that truncated the header has been removed, so the entire line after > is treated as the ID.

juliocesar-io avatar Sep 07 '24 05:09 juliocesar-io