foldcomp icon indicating copy to clipboard operation
foldcomp copied to clipboard

Extraction of `FASTA` adds unnecessary `.pdb` extension if absent, which leads to inconsistensies

Open valentynbez opened this issue 1 year ago • 2 comments

I.e., when I extract FASTA from afdb_swissprot_v4:

foldcomp_id = AF-B1YUJ2-F1-model_v4
fasta_header = AF-B1YUJ2-F1-model_v4.pdb

When I extract FASTA from my personal db:

foldcomp_id = MIP_00183643.pdb
fasta_header = MIP_00183643.pdb

I cannot use FASTA headers to query the database consistently.

valentynbez avatar Dec 27 '23 18:12 valentynbez

I'm sorry for the inconsistency. This kind of thing would happen with only provided databases that I removed ".pdb" from .lookup files. The default behavior of foldcomp is saving extension to the ID as your example from personal db. (It saves file name) For afdb databases we provided, we removed pdb extensions for saving spaces & easier scripting but I think that resulted into this kind of inconsistency. Appending ".pdb" extension to the second column of lookup file like scripts below would help.

# Appending pdb extensions
awk -F '\t' 'BEGIN {OFS="\t";} {$2=$2 ".pdb"; print;}' afdb_swissprot_v4.lookup > afdb_swissprot_v4.new.lookup
# Replace afdb_swissprot_v4.lookup with afdb_swissprot_v4.new.lookup

khb7840 avatar Dec 29 '23 03:12 khb7840

Proposed behaviour

Remove all file suffixes after first .. Throw a warning, that . are not allowed in the filenames, as it might results in duplications.

valentynbez avatar Apr 01 '24 11:04 valentynbez