mygene.info
mygene.info copied to clipboard
Allow for exact matches only?
Querying something like "BRCA1", I get a lot of seemingly unrelated matches such as "BRAT1".
This is obviously a symptom of the nature of ElasticSearch. In analytical use cases, personally, I think fuzzy matches are dangerous.
Could we add a query parameter to require an exact match? Or maybe it exists and I'm not seeing the docs?
@buchanae general query like q=BRCA1
will match multiple fields, like symbol, name, .... But fuzzy matches are not used. The match of "BRAT1" gene is because "BRCA1" is mentioned in its gene name.
You can get exactly what you need by using the fielded query:
or limited to human only:
Ah, ok, thanks!
I actually can't even reproduce the results I mentioned now. Wish I had posted the query.
These are the queries I tried this morning: https://gist.github.com/buchanae/5cba60894e190c35da1ac3e1c7e5e511
Here's an example I don't understand:
import mygene
mg = mygene.MyGeneInfo()
mg.querymany(["CBLB"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
[('CBLB', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'CBLB',
'_id': '868',
'_score': 89.78527,
'alias': ['Cbl-b', 'Nbla00127', 'RNF56'],
'ensembl': {'gene': 'ENSG00000114423'},
'symbol': 'CBLB'},
{'query': 'CBLB',
'_id': '326625',
'_score': 9.830278,
'alias': ['ATR', 'CFAP23', 'cblB', 'cob'],
'ensembl': {'gene': 'ENSG00000139428'},
'symbol': 'MMAB'}]
Since I'm not passing returnall=True
, shouldn't this return only the best hit?
And another.
mg.querymany(["MCM3"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
[('MCM3', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'MCM3',
'_id': '4172',
'_score': 84.13076,
'alias': ['HCC5', 'P1-MCM3', 'P1.h', 'RLFB'],
'ensembl': {'gene': 'ENSG00000112118'},
'symbol': 'MCM3'},
{'query': 'MCM3',
'_id': '4176',
'_score': 5.8433404,
'alias': ['CDC47',
'MCM2',
'P1.1-MCM3',
'P1CDC47',
'P85MCM',
'PNAS146',
'PPP1R104'],
'ensembl': {'gene': 'ENSG00000166508'},
'symbol': 'MCM7'}]
As far as I can tell, the second match is happening because of a partial match on the string P1.1-MCM3
@buchanae "alias" field was indexed as free text, as we did observe the values of "alias" field can have whitespaces in it sometime. We can do some more inspection on the alias field and optimize the indexing a bit (e.g. do not treat "-" as a word separator).
"alias" field is coming from entrez_gene collection, currently contains 21M documents:
- 14642 documents have an alias field with space in it, (eg. gene 814677, "SEC12P-like 2 protein")
- 117911 docs have an alias with a "-" in it, (eg. gene 35543593, "xcc-b100_0084")