foldcomp
foldcomp copied to clipboard
Extraction of `FASTA` adds unnecessary `.pdb` extension if absent, which leads to inconsistensies
I.e., when I extract FASTA from afdb_swissprot_v4
:
foldcomp_id = AF-B1YUJ2-F1-model_v4
fasta_header = AF-B1YUJ2-F1-model_v4.pdb
When I extract FASTA from my personal db:
foldcomp_id = MIP_00183643.pdb
fasta_header = MIP_00183643.pdb
I cannot use FASTA headers to query the database consistently.
I'm sorry for the inconsistency. This kind of thing would happen with only provided databases that I removed ".pdb" from .lookup files. The default behavior of foldcomp is saving extension to the ID as your example from personal db. (It saves file name) For afdb databases we provided, we removed pdb extensions for saving spaces & easier scripting but I think that resulted into this kind of inconsistency. Appending ".pdb" extension to the second column of lookup file like scripts below would help.
# Appending pdb extensions
awk -F '\t' 'BEGIN {OFS="\t";} {$2=$2 ".pdb"; print;}' afdb_swissprot_v4.lookup > afdb_swissprot_v4.new.lookup
# Replace afdb_swissprot_v4.lookup with afdb_swissprot_v4.new.lookup
Proposed behaviour
Remove all file suffixes after first .
. Throw a warning, that .
are not allowed in the filenames, as it might results in duplications.