mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

Allow for exact matches only?

Open buchanae opened this issue 5 years ago • 6 comments

Querying something like "BRCA1", I get a lot of seemingly unrelated matches such as "BRAT1".

This is obviously a symptom of the nature of ElasticSearch. In analytical use cases, personally, I think fuzzy matches are dangerous.

Could we add a query parameter to require an exact match? Or maybe it exists and I'm not seeing the docs?

buchanae avatar Jul 18 '18 21:07 buchanae

@buchanae general query like q=BRCA1 will match multiple fields, like symbol, name, .... But fuzzy matches are not used. The match of "BRAT1" gene is because "BRCA1" is mentioned in its gene name.

You can get exactly what you need by using the fielded query:

q=symbol:BRCA1

or limited to human only:

q=symbol:BRCA1&species=human

newgene avatar Jul 18 '18 22:07 newgene

Ah, ok, thanks!

I actually can't even reproduce the results I mentioned now. Wish I had posted the query.

These are the queries I tried this morning: https://gist.github.com/buchanae/5cba60894e190c35da1ac3e1c7e5e511

buchanae avatar Jul 19 '18 16:07 buchanae

Here's an example I don't understand:

import mygene
mg = mygene.MyGeneInfo()
mg.querymany(["CBLB"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
	[('CBLB', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'CBLB',
  '_id': '868',
  '_score': 89.78527,
  'alias': ['Cbl-b', 'Nbla00127', 'RNF56'],
  'ensembl': {'gene': 'ENSG00000114423'},
  'symbol': 'CBLB'},
 {'query': 'CBLB',
  '_id': '326625',
  '_score': 9.830278,
  'alias': ['ATR', 'CFAP23', 'cblB', 'cob'],
  'ensembl': {'gene': 'ENSG00000139428'},
  'symbol': 'MMAB'}]

Since I'm not passing returnall=True, shouldn't this return only the best hit?

buchanae avatar Jul 19 '18 17:07 buchanae

And another.

mg.querymany(["MCM3"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
	[('MCM3', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'MCM3',
  '_id': '4172',
  '_score': 84.13076,
  'alias': ['HCC5', 'P1-MCM3', 'P1.h', 'RLFB'],
  'ensembl': {'gene': 'ENSG00000112118'},
  'symbol': 'MCM3'},
 {'query': 'MCM3',
  '_id': '4176',
  '_score': 5.8433404,
  'alias': ['CDC47',
   'MCM2',
   'P1.1-MCM3',
   'P1CDC47',
   'P85MCM',
   'PNAS146',
   'PPP1R104'],
  'ensembl': {'gene': 'ENSG00000166508'},
  'symbol': 'MCM7'}]

As far as I can tell, the second match is happening because of a partial match on the string P1.1-MCM3

buchanae avatar Jul 19 '18 17:07 buchanae

@buchanae "alias" field was indexed as free text, as we did observe the values of "alias" field can have whitespaces in it sometime. We can do some more inspection on the alias field and optimize the indexing a bit (e.g. do not treat "-" as a word separator).

newgene avatar Jul 24 '18 17:07 newgene

"alias" field is coming from entrez_gene collection, currently contains 21M documents:

  • 14642 documents have an alias field with space in it, (eg. gene 814677, "SEC12P-like 2 protein")
  • 117911 docs have an alias with a "-" in it, (eg. gene 35543593, "xcc-b100_0084")

sirloon avatar Aug 06 '18 15:08 sirloon