phylophlan icon indicating copy to clipboard operation
phylophlan copied to clipboard

Add option to generate a nucl database from the Uniprot core proteins

Open alexhbnr opened this issue 2 years ago • 5 comments

Instead of downloading the protein sequences directly from Uniprot, this adds the possibility to retrieve the corresponding nucleotide sequences from ENA via metadata stored in XML format.

It iterates over the same input files that are necessary for the functionality to retrieve amino acid sequences from Uniprot. However, instead of directly downloading the FastA file, it downloads the XML file from the Uniprot server. The XML file is parsed using a XML scheme provided from the Uniprot website, then the ENA accession ids for the nucleotide sequences are extracted and the FastA sequences downloaded.

alexhbnr avatar May 12 '22 15:05 alexhbnr

Thanks Alex for this PR. I tried running the new version of phylophlan_setup_database.py adding the xmlschema package (version 1.10.0 from conda-forge) to my conda env. However, I'm getting the following error:

Traceback (most recent call last):
  File "./phylophlan_setup_database.py", line 25, in <module>
    import xmlschema
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/xmlschema/__init__.py", line 14, in <module>
    from .resources import normalize_url, normalize_locations, fetch_resource, \
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/xmlschema/resources.py", line 23, in <module>
    from elementpath import iter_select, XPathContext, XPath2Parser
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/__init__.py", line 18, in <module>
    from .exceptions import ElementPathError, MissingContextError, \
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/exceptions.py", line 12, in <module>
    from .tdop import Token
  File "/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/phylophlan-3.0/lib/python3.6/site-packages/elementpath/tdop.py", line 405, in <module>
    class Parser(Generic[TK_co], metaclass=ParserMeta):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

and I'm not 100% sure how to fix it. Do you have any idea?

fasnicar avatar May 18 '22 07:05 fasnicar

Which exact version of Python are you using on your system, Francesco? I get different results for different versions of Python 3.6, but of course not the same one as you.

alexhbnr avatar May 18 '22 08:05 alexhbnr

I have the 3.6.15 from conda-forge (hb7a2778_0_cpython).

fasnicar avatar May 18 '22 10:05 fasnicar

OK, when I create a fresh Python 3.6.15 conda repo and install xmlsearch, I can import it without any issues. I only get one at 3.6.0 itself. I will dig a bit further in the next days what's going on there.

alexhbnr avatar May 18 '22 15:05 alexhbnr

Hi @fasnicar,

I am very sorry for long hiatus. It got lost in my long list of to-dos.

I pulled all the recent changes that you added to v3.0.3 into this PR. I installed the latest version of PhyloPhlAn v3.0.3 via conda/mamba into a new environment using the follow command: mamba create -n phylophlan_uniprot_test -c bioconda phylophlan=3.0.3

Afterwards, I installed the changes of this PR using pip3: pip3 install -U git+https://github.com/alexhbnr/phylophlan@uniprot_nuclseq

The pip command installed the Python package xmlschema v2.2.2 and elementpath v4.0.1. When I ran phylophlan_setup_database -h, I didn't get any error message. However, conda/mamba automatically pulled Python version 3.11, and not v3.6 for which you saw the error.

Would you have time to check this PR once more on your system?

alexhbnr avatar Mar 06 '23 13:03 alexhbnr