foldcomp icon indicating copy to clipboard operation
foldcomp copied to clipboard

filtering with length and plddt

Open Atomu2014 opened this issue 2 months ago • 2 comments

Hi,

I want to follow the processing of Proteina to extract the subset from afdb_rep:

- minimum average pLLDT 80, minimum length 256 and maximum length 512/768

I used python to iterate over the database, but did not find plddt through the get_data() interface.

- dict_keys(['phi', 'psi', 'omega', 'torsion_angles', 'bond_angles', 'residues', 'b_factors', 'coordinates'])

Considering the afdb_rep is quite large (2.3M instances), can you provide some example code on how to effciently filter with avg plddt and length?

Thanks

Atomu2014 avatar Oct 22 '25 23:10 Atomu2014

Hi,

I downloaded the meta data from https://afdb-cluster.steineggerlab.workers.dev/?utm_source=chatgpt.com

Image

And I filter the meta data with the following cmd

awk -F'\t' '$4 > 256 && $4 <= 768 && $6 > 80 {print $1}' \
2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv \
> selected_rep_ids.txt

which yields a file with repids like A0A6A4IZ81. I tried to read the database with ids but there is no data returned,

with foldcomp.open(f"{data_dir}/afdb_rep_v4", ids=ids[:10]) as db: # For whole database, use foldcomp.open("afdb_rep_v4")
    for name, pdb in tqdm(db):
    ...

0it [00:00, ?it/s]

Can you help verify the correctness of the codes above?

Thanks

Atomu2014 avatar Oct 23 '25 00:10 Atomu2014

I tried this style of ids but it still returns nothing:

with open(f"{data_dir}/selected_rep_ids.txt") as f:
    ids = [f"AF-{line.strip()}-F1-model_v4.cif.gz" for line in f if line.strip()]

Atomu2014 avatar Oct 23 '25 00:10 Atomu2014