Adding the contam_ prefix doesn't work with an existing database. It only works when downloading the database using Philosopher
This is a bug reported by emails. I put it here in case we forget it. It also let potential affected users know: human contaminant proteins are always not marked with prefix contam_ even when the original fasta file is not from human proteome.
Following are the steps to reproduce this bug:
- generate a test.fasta file with entry
>sp|OABCD|Test_A Keratin, type I cytoskeletal 15 OS=Ovis aries GN=KRT15 PE=2 SV=1
TESTABCD
- Run Philosopher command
philosopher.exe workspace --clean
philosopher.exe workspace --init
philosopher.exe database --custom test.fasta --contam --contamprefix --nodecoys
philosopher.exe workspace --clean
Following are the test.fasta and the output from Philosopher as references.
dev.zip
Best,
Fengchao
Just an update. Philosopher-v4.5.1-RC21 still has this issue.
Best,
Fengchao
Can you give me an example? I prepared a human database just today, and no human protein is marked as contaminant.
When you prepare a non-human database, all human contaminant are NOT marked as contaminant, which is not supposed to be.
See my first message in this thread to reproduce it.
Best,
Fengchao
I just tested with E.coli
>rev_contam_sp|Q02958|KRA61_SHEEP Keratin-associated protein 6-1 OS=Ovis aries OX=9940 GN=KRTAP6-1 PE=1 SV=2
>rev_contam_sp|Q06830|PRDX1_HUMAN Peroxiredoxin-1 OS=Homo sapiens OX=9606 GN=PRDX1 PE=1 SV=1
>rev_contam_sp|Q10735|PEPB_PIG Pepsin B (Fragment) OS=Sus scrofa OX=9823 GN=PGB PE=1 SV=1
>rev_contam_sp|Q14525|KT33B_HUMAN Keratin, type I cuticular Ha3-II OS=Homo sapiens OX=9606 GN=KRT33B PE=1 SV=3
>rev_contam_sp|Q14532|K1H2_HUMAN Keratin, type I cuticular Ha2 OS=Homo sapiens OX=9606 GN=KRT32 PE=1 SV=3
>rev_contam_sp|Q14533|KRT81_HUMAN Keratin, type II cuticular Hb1 OS=Homo sapiens OX=9606 GN=KRT81 PE=1 SV=3
>rev_contam_sp|Q15323|K1H1_HUMAN Keratin, type I cuticular Ha1 OS=Homo sapiens OX=9606 GN=KRT31 PE=1 SV=3
>rev_contam_sp|Q15843|NEDD8_HUMAN NEDD8 OS=Homo sapiens OX=9606 GN=NEDD8 PE=1 SV=1
>rev_contam_sp|Q29463|TRY2_BOVIN Anionic trypsin OS=Bos taurus OX=9913 PE=2 SV=1
>rev_contam_sp|Q7M135|LYSC_LYSEN Lysyl endopeptidase OS=Lysobacter enzymogenes OX=69 PE=1 SV=1
>rev_contam_sp|Q92764|KRT35_HUMAN Keratin, type I cuticular Ha5 OS=Homo sapiens OX=9606 GN=KRT35 PE=1 SV=5
>rev_contam_sp|Q9NSB2|KRT84_HUMAN Keratin, type II cuticular Hb4 OS=Homo sapiens OX=9606 GN=KRT84 PE=2 SV=2
>rev_contam_sp|Q9NSB4|KRT82_HUMAN Keratin, type II cuticular Hb2 OS=Homo sapiens OX=9606 GN=KRT82 PE=1 SV=3
Can you try this fasta file: test.zip
organism-based prefix tagging only works with Uniprot records, Uniprot-like wont work
Fine. This real E. Coli protein from UniProt also triggered the bug. real_uniprot_ecoli_protein.zip
Did you download with Philosopher ? this dynamic tagging only works when you use the --id flag
I used my own E. Coli fasta file. Adding decoys and contaminants using Philosopher. And then, the human contaminants are not marked as contaminants.
this dynamic tagging only works when you use the --id flag
Do you mean that it only works when downloading the fasta file using Philosopher, but it won't work when adding decoys and contaminants for existing fasta file? Then, you should clearify it in the Philosopher doc and tutorial. It is kind of misleading.
Best,
Fengchao
It will only work when you download from UniProt. Can you point to me which documentation is telling the opposite ? As far as I know, the documentation still hasn't been updated because we did not release this feature.
In the released 4.4.0 version, the following marked line is not accurate. Philosopher can not mark the contaminant sequences for any existing fasta files. It can only mark when it downloads the fasta file.
Target-Decoy database formatting
Usage: philosopher database [flags]
Flags: --add string add custom sequences (UniProt FASTA format only) --annotate string process a ready-to-use database --contam add common contaminants --contamprefix mark the contaminant sequences with a prefix tag --custom string use a pre-formatted custom database --enzyme string enzyme for digestion (trypsin, lys_c, lys_n, glu_c, chymotrypsin) (default "trypsin") -h, --help help for database --id string UniProt proteome ID --isoform add isoform sequences --nodecoys don't add decoys to the database --prefix string define a decoy prefix (default "rev_") --reviewed use only reviwed sequences from Swiss-Prot
Best,
Fengchao
Yes, only for downloaded sequences. I can add some more details to the description. Can you point me to the tutorial please?
The "tutorial" I mentioned is the wiki page here https://github.com/Nesvilab/philosopher/wiki/Database
Best,
Fengchao
Thanks. As I mentioned, the flag and the dynamic tagging will be fully explained when we release the new version. I can add the prefix flag for now, but the rest will only come later.
The bug is still there: https://github.com/Nesvilab/philosopher/issues/528, https://github.com/Nesvilab/philosopher/issues/529, https://github.com/Nesvilab/philosopher/issues/516, https://github.com/Nesvilab/philosopher/issues/510#issuecomment-2483454242, and https://github.com/Nesvilab/philosopher/issues/498