idr-notebooks Search engine test

Add notebook comparing search engine call and mapr call cc @pwalczysko

Sep 06 '22 10:09 jburel

:point_left: Launch a binder notebook on branch search_engine_test

Sep 06 '22 10:09 github-actions[bot]

The notebook works as expected with a list of

"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen"

When the list of genes is widened to

"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p42", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php"

Then I am getting an error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:1, in <module>

Input In [51], in load_using_search_api()
      7 url = KEY_VALUE_SEARCH.format(**qs1)  
      8 json = session.get(url).json()
----> 9 images = json['results']['results']
     10 for image in images:
     11     if image['id'] not in ids:

TypeError: list indices must be integers or slices, not str

The error is in the cell "Search using search engine"

Sep 06 '22 12:09 pwalczysko

Thanks. This is probably due to the fact that for some genes no results are found. I will adjust that

Sep 06 '22 12:09 jburel

@pwalczysko fixed

Sep 06 '22 12:09 jburel

Thanks, that works.

But further, for some reason, when a non-existing Gene is searched for, the test fails (should it ? one can argue that it should pass).

The test fails with (see below). Note that I added print statements, which show that in the list there was a non-existing gene called blah.

I think it would be good either to make either

the test not fail (as both search approaches should deliver an empty list ?)
or warn the user about the fact that one search (or both) was completely empty.

print (added)
print (len(added))
print (removed)
print (len(removed))
print (modified)
print (len(modified))
assert len(added) == 0
assert len(removed) == 0
assert len(modified) == 0
assert len(same) == len(ITEMS)


{'blah'}
1
set()
0
{}
0

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [44], in <cell line: 7>()
      5 print (modified)
      6 print (len(modified))
----> 7 assert len(added) == 0
      8 assert len(removed) == 0
      9 assert len(modified) == 0

AssertionError:

Sep 06 '22 14:09 pwalczysko

I will sort that out

Sep 06 '22 14:09 jburel

@pwalczysko fixed

Sep 07 '22 09:09 jburel

Thanks @jburel , the fix works fine when small number of genes is passed into the list.

Now with a list such as "pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p47", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php3", "neco"

I am getting persistently

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Sep 07 '22 18:09 pwalczysko

The problem with the data rate exceeded is due to the line

print(results_mapr)

Probably too much to print out. When I comment out this line, all works fine.

Sep 07 '22 18:09 pwalczysko

@pwalczysko I have added the ability to load all the possible values for a given key. The values are sorted Searching for all the values afterwards is not recommended so I have added the ability to search by interval e.g 0-10 20-30

Sep 08 '22 12:09 jburel

Thanks, works fine.

I have for a search between 0 and 500

Fail of assert len(added) == 0
When I print (added) I get 28 items

{'acp7', 'ac012476.1', 'ac073896.1', 'ac171558.1', 'abraxas1', 'ac008695.1', 'acod1', 'ac008687.1', 'ac004754.3', 'ac240274.1', 'ac004556.1', 'ac011462.1', 'ac022414.1', 'ac171558.2', 'ac145212.1', 'ac138969.4', 'ac104534.3', 'ac136352.1', 'abhd18', 'ac023055.1', 'ac092718.8', 'ac009163.4', 'ac010531.1', 'ac092718.3', 'ac126283.2', 'abraxas2', 'ac091959.3', 'ac006538.4'}
28

Does that mean that search_engine is returning 28 more search results than mapr ?

Edit: For a search between 501 and 1000, added test also fails, print (added) gives

{'agap5', 'akain1', 'agap6', 'adgre1', 'afg1l', 'af165138.7', 'agap9'}
7

Note that for these long searches, mapr has some 42 minutes against 55 sec of search_engine.

Sep 08 '22 14:09 pwalczysko

I will have to investigate

Sep 08 '22 14:09 jburel

Yes, I think that

def dict_compare(d1, d2):
...
added = d1_keys - d2_keys
...
dict_compare(results, results_mapr)
...
added, removed, modified, same = dict_compare(results, results_mapr)

means that there are more search_engine keys than the mapr keys. Wonder how could that be possible ?

Edit: I have also confirmed that the result is repeatable, the list of added Keys does not vary between the runs of the playbook with the same params.

Sep 08 '22 17:09 pwalczysko

It seems to be more a problem with the logic. A direct mapr vs search_engine with for example agap5 gives me the same result via the UI

Sep 08 '22 19:09 jburel

Tested genes between 0 and 1500. No mismatches, all looks good with the new commit (took something like 5 + 20 + 17 minutes on mapr step).

Sep 14 '22 18:09 pwalczysko

Tested further 1500-3000, in 3 5-hundred strong batches. The test is passing in full, but the times for mapr can be even 40 mins for 500 genes search. I suppose that this is because there are more results for those genes.

This means we have now 0 - 3000 tested.

Sep 15 '22 12:09 pwalczysko

only 47000 to go :-)

Sep 15 '22 13:09 jburel

13000 (13 thousand) done as of today ;)

Sep 16 '22 15:09 pwalczysko

Between 18501 - 19000 I got an error on the mapr cell execution (the search engine one returned fine)

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
    970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
File <timed exec>:1, in <module>

Input In [11], in load_using_mapr(values)
     43 qs1 = {'key': KEY_MAPR, 'value': item}
     44 url1 = MAPR_URL.format(**qs1)
---> 45 json = session.get(url1).json()
     46 for m in json['maps']:
     47     qs2 = {'key': KEY_MAPR, 'value': item}

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Edit: This was an intermittent error, did not repeat on second run.

Sep 21 '22 17:09 pwalczysko

@jburel now I am consistently getting a following error on the cell

values = load_values_for_given_key()

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
    970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
    339 if end != len(s):
--> 340     raise JSONDecodeError("Extra data", s, end)
    341 return obj

JSONDecodeError: Extra data: line 1 column 5 (char 4)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 values = load_values_for_given_key()

Input In [6], in load_values_for_given_key()
      4 qs1 = {'type': 'image', 'key': KEY}
      5 url = KEYS_SEARCH.format(**qs1)  
----> 6 json = session.get(url).json()
      7 for d in json['data']:
      8     if d['Value']:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Extra data: line 1 column 5 (char 4)

Sep 22 '22 18:09 pwalczysko