Search engine test
Add notebook comparing search engine call and mapr call cc @pwalczysko
The notebook works as expected with a list of
"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen"
When the list of genes is widened to
"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p42", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php"
Then I am getting an error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File <timed exec>:1, in <module>
Input In [51], in load_using_search_api()
7 url = KEY_VALUE_SEARCH.format(**qs1)
8 json = session.get(url).json()
----> 9 images = json['results']['results']
10 for image in images:
11 if image['id'] not in ids:
TypeError: list indices must be integers or slices, not str
The error is in the cell "Search using search engine"
Thanks. This is probably due to the fact that for some genes no results are found. I will adjust that
@pwalczysko fixed
Thanks, that works.
But further, for some reason, when a non-existing Gene is searched for, the test fails (should it ? one can argue that it should pass).
The test fails with (see below). Note that I added print statements, which show that in the list there was a non-existing gene called blah.
I think it would be good either to make either
- the test not fail (as both search approaches should deliver an empty list ?)
- or warn the user about the fact that one search (or both) was completely empty.
print (added)
print (len(added))
print (removed)
print (len(removed))
print (modified)
print (len(modified))
assert len(added) == 0
assert len(removed) == 0
assert len(modified) == 0
assert len(same) == len(ITEMS)
{'blah'}
1
set()
0
{}
0
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [44], in <cell line: 7>()
5 print (modified)
6 print (len(modified))
----> 7 assert len(added) == 0
8 assert len(removed) == 0
9 assert len(modified) == 0
AssertionError:
I will sort that out
@pwalczysko fixed
Thanks @jburel , the fix works fine when small number of genes is passed into the list.
Now with a list such as
"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p47", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php3", "neco"
I am getting persistently
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
The problem with the data rate exceeded is due to the line
print(results_mapr)
Probably too much to print out. When I comment out this line, all works fine.
@pwalczysko I have added the ability to load all the possible values for a given key. The values are sorted
Searching for all the values afterwards is not recommended so I have added the ability to search by interval e.g 0-10 20-30
Thanks, works fine.
I have for a search between 0 and 500
- Fail of assert len(added) == 0
- When I print (added) I get 28 items
{'acp7', 'ac012476.1', 'ac073896.1', 'ac171558.1', 'abraxas1', 'ac008695.1', 'acod1', 'ac008687.1', 'ac004754.3', 'ac240274.1', 'ac004556.1', 'ac011462.1', 'ac022414.1', 'ac171558.2', 'ac145212.1', 'ac138969.4', 'ac104534.3', 'ac136352.1', 'abhd18', 'ac023055.1', 'ac092718.8', 'ac009163.4', 'ac010531.1', 'ac092718.3', 'ac126283.2', 'abraxas2', 'ac091959.3', 'ac006538.4'}
28
Does that mean that search_engine is returning 28 more search results than mapr ?
Edit: For a search between 501 and 1000, added test also fails, print (added) gives
{'agap5', 'akain1', 'agap6', 'adgre1', 'afg1l', 'af165138.7', 'agap9'}
7
Note that for these long searches, mapr has some 42 minutes against 55 sec of search_engine.
I will have to investigate
Yes, I think that
def dict_compare(d1, d2):
...
added = d1_keys - d2_keys
...
dict_compare(results, results_mapr)
...
added, removed, modified, same = dict_compare(results, results_mapr)
means that there are more search_engine keys than the mapr keys. Wonder how could that be possible ?
Edit: I have also confirmed that the result is repeatable, the list of added Keys does not vary between the runs of the playbook with the same params.
It seems to be more a problem with the logic.
A direct mapr vs search_engine with for example agap5 gives me the same result via the UI
Tested genes between 0 and 1500. No mismatches, all looks good with the new commit (took something like 5 + 20 + 17 minutes on mapr step).
Tested further 1500-3000, in 3 5-hundred strong batches. The test is passing in full, but the times for mapr can be even 40 mins for 500 genes search. I suppose that this is because there are more results for those genes.
This means we have now 0 - 3000 tested.
only 47000 to go :-)
13000 (13 thousand) done as of today ;)
Between 18501 - 19000 I got an error on the mapr cell execution (the search engine one returned fine)
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
970 try:
--> 971 return complexjson.loads(self.text, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
333 """Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
JSONDecodeError Traceback (most recent call last)
File <timed exec>:1, in <module>
Input In [11], in load_using_mapr(values)
43 qs1 = {'key': KEY_MAPR, 'value': item}
44 url1 = MAPR_URL.format(**qs1)
---> 45 json = session.get(url1).json()
46 for m in json['maps']:
47 qs2 = {'key': KEY_MAPR, 'value': item}
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
971 return complexjson.loads(self.text, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Edit: This was an intermittent error, did not repeat on second run.
@jburel now I am consistently getting a following error on the cell
values = load_values_for_given_key()
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
970 try:
--> 971 return complexjson.loads(self.text, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj
JSONDecodeError: Extra data: line 1 column 5 (char 4)
During handling of the above exception, another exception occurred:
JSONDecodeError Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 values = load_values_for_given_key()
Input In [6], in load_values_for_given_key()
4 qs1 = {'type': 'image', 'key': KEY}
5 url = KEYS_SEARCH.format(**qs1)
----> 6 json = session.get(url).json()
7 for d in json['data']:
8 if d['Value']:
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
971 return complexjson.loads(self.text, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
JSONDecodeError: Extra data: line 1 column 5 (char 4)