nosql-biosets
nosql-biosets copied to clipboard
naive index and query scripts for free reference bioinformatics datasets
Project aim and summary
NoSQL-biosets project includes scripts for indexing and querying selected free bioinformatics datasets.
Elasticsearch and MongoDB are two databases supported for most datasets included in the project. Neo4j and PostgreSQL support was implemented as the third database option for few datasets, namely for IntEnz, PubTator and HGNC.
Datasets supported
Datasets that had more attention and have more stable support:
-
UniProtKB datasets in XML format:
./nosqlbiosets/uniprot
-
IntEnz dataset in XML format:
./nosqlbiosets/intenz
-
ModelSEEDDatabase compounds and reactions data files in tsv format:
./nosqlbiosets/modelseed/index.py
-
MetaNetX compounds and reactions:
./nosqlbiosets/metanetx
-
HMDB proteins, metabolites datasets:
./hmdb#index-hmdb
-
DrugBank drugs and drug targets dataset:
./hmdb#index-drugbank
-
HGNC genenames.org, data files in json format, from EMBL-EBI:
./geneinfo/hgnc_geneinfo.py
(tests made with complete HGNC dataset) -
PubMed and PMC articles:
./nosqlbiosets/pubmed
Datasets that has been added recently:
-
ClinVar, aggregated information about genomic variation and its relationship to human health https://www.ncbi.nlm.nih.gov/clinvar/ ./nosqlbiosets/variation/
-
FAERS, FDA adverse event reports archive, https://open.fda.gov/data/faers/ ./nosqlbiosets/fda/
-
InterPro, protein families, http://www.ebi.ac.uk/interpro/
./nosqlbiosets/uniprot/interpro.py
Datasets that had less attention after the initial support added to the project:
-
Metabolic network files in SBML format or PSAMM project's yaml format:
./nosqlbiosets/pathways/index_metabolic_networks.py
(tests made with BiGG and PSAMM collections) -
PubChem BioAssay json files:
./nosqlbiosets/pubchem
-
WikiPathways gpml files:
./nosqlbiosets/pathways/index_wikipathways.py
-
Ensembl regulatory build GFF files:
./geneinfo/ensembl_regbuild.py
at early stages of development -
PubTator gene2pub and disease2pub mappings:
./nosqlbiosets/pubtator
-
RNAcentral identifier mappings,
./geneinfo/rnacentral_idmappings.py
-
KEGG pathway kgml/xml files:
./nosqlbiosets/kegg/index.py
at its early stages of development (KEGG data distribution policy lets us think twice when spending time on KEGG data)
Project aims to connect above datasets by implementing query APIs for common query patterns with individual and multiple indexes. It also includes initial work on returning query results of IntEnz, DrugBank, HMDB, ModelSEEDdb, and MetaNetX datasets as graphs.
A sister project aims to develop index scripts for sequence similarity search results, either in NCBI-BLAST json format or in BLAST tabular format which is used by other search programs as well, such as LAMBDA and DIAMOND. HSPsDB project aims to link the indexed search results to the datasets indexed with this project, nosqlbiosets.
Installation
Download nosqlbiosets project source code and install required libraries:
git clone https://bitbucket.org/hspsdb/nosql-biosets.git
cd nosql-biosets
pip install -r requirements.txt --user
Project could be installed
using the setup.py
develop
and --user
options
that should allow running the index scripts from project
source folders:
python setup.py develop --user
Default values of the hostname and port numbers of Elasticsearch and MongoDB servers
are read from ./conf/dbservers.json
file.
Save your settings in this file to avoid entering --host
and --port
parameters in command line.
Usage
Example command lines for downloading UniProt Knowledgebase Swiss-Prot data set (~690M) and for indexing:
$ wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz
Make sure your Elasticsearch server is running in your localhost.
If you are new to Elasticsearch and you are using Linux
the easiest way is to download Elasticsearch with the TAR option (~32M).
After extracting the tar file cd
to your Elasticsearch folder
and run ./bin/elasticsearch
command.
Downloaded UniProt xml file can be indexed by running the following command from nosqlbiosets project root folder, typically requires 2 to 8 hours with Elasticsearch, and between 1 and 5 hours with MongoDB
./nosqlbiosets/uniprot/index.py ./uniprot_sprot.xml.gz\
--host localhost --db Elasticsearch --index uniprot
Example query: list most mentioned gene names
curl -XGET "http://localhost:9200/uniprot/_search?pretty=true"\
-H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"genes": {
"terms": {
"field": "gene.name.#text.keyword",
"size": 5
},
"aggs": {
"tids": {
"terms": {
"field": "gene.name.type.keyword",
"size": 5
}
}
}
}
}
}'
Check ./tests/test_uniprot_queries.py
and ./nosqlbiosets/uniprot/query.py
for
example queries with Elasticsearch and MongoDB.
Similar Work
-
https://github.com/daler/gffutils: "GFF and GTF files are loaded into SQLite3 databases, allowing much more complex manipulation of hierarchical features (e.g., genes, transcripts, and exons) than is possible with plain-text methods alone"
We are inspired by the gffutils project. Needless to say, nosql-biosets project doesn't yet have a level of maturity comparable to the gffutils library.
-
https://github.com/quinlan-lab/vcf2db (SQLite, MySQL, PostgreSQL)
Copyright
NoSQL-biosets project has been developed at King Abdullah University of Science and Technology, http://www.kaust.edu.sa
NoSQL-biosets project is licensed with MIT license.
This project has not reached to a good level of maturity and stalled.
Acknowledgements
- Computers and systems used in developing this work have been maintained by John Hanks, Arnaud Hungler, and Mohammed Saif