foldcomp icon indicating copy to clipboard operation
foldcomp copied to clipboard

Subsetting databases

Open patrickbryant1 opened this issue 1 year ago • 7 comments

Hi,

Thank you for the great resource!

I am having trouble subsetting databases and decompressing subsets of the databases you provide here: https://foldcomp.steineggerlab.workers.dev

According to the instructions, I should be able to decompress a subset of a database given an "id_list.txt".

This is how I do it for e.g. A. thaliana:

head -n 1 data/a_thaliana.lookup 0 AF-A0A178UFC4-F1-model_v4.pdb 0

As I understand it, the ID here is "AF-A0A178UFC4-F1-model_v4".

Now, I write this into a file called id_list.txt, then I run the command: foldcomp decompress --id-list id_list.txt data/a_thaliana

with the response: Decompressing files in data/a_thaliana using 1 threads Output directory: data/a_thaliana_pdb/ [Warning] AF-A0A178UFC4-F1-model_v4 not found in database.

I have tried many different ways of naming the ids based on what is in a_thaliana.lookup, but nothing seems to work. The same using mmseqs to subset the database: """ createsubdb --subdb-mode 0 --id-mode 1 id_list.txt a_thaliana test_sel/output_foldcomp_db

MMseqs Version: ad6dfc66d7bbc4fd626fc19adf10ba587bc137c4 Subdb mode 0 Database ID mode 1 Verbosity 3

Could not find name AF-A0A178UFC4-F1-model_v4 in lookup Time for merging to output_foldcomp_db: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 34ms """

Can you please explain what I am doing wrong and how to properly specify the IDs?

Best,

Patrick

patrickbryant1 avatar Aug 09 '23 07:08 patrickbryant1

I noticed, this seems to work with afdb_rep_v4. Perhaps something is missing from the reference genomes?

patrickbryant1 avatar Aug 09 '23 08:08 patrickbryant1

I'm sorry there was a bug at assigning mode for database reading. Thank you for notifying this and please check if this is solved in the latest version.

khb7840 avatar Aug 09 '23 11:08 khb7840

Hi, Great - thanks. What do you mean with the latest version:

  1. Of the database from https://foldcomp.steineggerlab.workers.dev
  2. Of Foldcomp
  3. Something else(?)

patrickbryant1 avatar Aug 09 '23 12:08 patrickbryant1

Latest version of Foldcomp. Subsetting 'a_thaliana' should work with foldcomp of latest commit.

khb7840 avatar Aug 09 '23 13:08 khb7840

Ok, great. Does this include the binaries you distribute or only the pip installation/git clone? Do you know why mmseqs2 seems to fail on the same files? Is there something missing in the subsetting instructions there as well?

patrickbryant1 avatar Aug 09 '23 14:08 patrickbryant1

Please use git clone to get the latest updare. Python distribution is not updated with the latest commit. For the mmseqs2 part, I'm not sure what happened. I'll check this with mmseqs2 developers.

khb7840 avatar Aug 09 '23 14:08 khb7840

Ok, thanks for the help!

patrickbryant1 avatar Aug 09 '23 14:08 patrickbryant1