filtering with length and plddt
Hi,
I want to follow the processing of Proteina to extract the subset from afdb_rep:
- minimum average pLLDT 80, minimum length 256 and maximum length 512/768
I used python to iterate over the database, but did not find plddt through the get_data() interface.
- dict_keys(['phi', 'psi', 'omega', 'torsion_angles', 'bond_angles', 'residues', 'b_factors', 'coordinates'])
Considering the afdb_rep is quite large (2.3M instances), can you provide some example code on how to effciently filter with avg plddt and length?
Thanks
Hi,
I downloaded the meta data from https://afdb-cluster.steineggerlab.workers.dev/?utm_source=chatgpt.com
And I filter the meta data with the following cmd
awk -F'\t' '$4 > 256 && $4 <= 768 && $6 > 80 {print $1}' \
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv \
> selected_rep_ids.txt
which yields a file with repids like A0A6A4IZ81. I tried to read the database with ids but there is no data returned,
with foldcomp.open(f"{data_dir}/afdb_rep_v4", ids=ids[:10]) as db: # For whole database, use foldcomp.open("afdb_rep_v4")
for name, pdb in tqdm(db):
...
0it [00:00, ?it/s]
Can you help verify the correctness of the codes above?
Thanks
I tried this style of ids but it still returns nothing:
with open(f"{data_dir}/selected_rep_ids.txt") as f:
ids = [f"AF-{line.strip()}-F1-model_v4.cif.gz" for line in f if line.strip()]