search_inside API does not work for some items with spaces in the document name
Searching the text contents of some items is failing with a message that simply states "Sorry, there was an error with your search. Please try again."
It appears to be tied to the fulltext/inside.php endpoint, and I suspect it is an issue of query params not being encoded somewhere in the API backend.
Evidence / Screenshot (if possible)
Here is the error thrown by ia-sentry.min.js in the developer console after executing a search:
Search Inside Response Error Whoops! Traceback (most recent call last):
File "./inside.py", line 158, in <module>
reply = urllib.request.urlopen(es_url).read()
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1397, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/usr/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1267, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.8/http/client.py", line 1101, in putrequest
self._validate_path(url)
File "/usr/lib/python3.8/http/client.py", line 1201, in _validate_path
raise InvalidURL(f"URL can't contain control characters. {url!r} "
http.client.InvalidURL: URL can't contain control characters. '/api/v1/searchinside?exists=true&ia_id=nintendo-magazine-system-uk-43-april-1996&filename=Nintendo Magazine System (UK) 43 April 1996_abbyy.gz' (found at least ' ')
This looks like the doc name passed to the filename query param somewhere in the backend is not being encoded.
The corresponding network request does appear to encode the doc name from the given payload:
item_id: nintendo-magazine-system-uk-43-april-1996
doc: Nintendo Magazine System (UK) 43 April 1996
path: /27/items/nintendo-magazine-system-uk-43-april-1996
q: "mario rpg"
pre_tag: {{{
post_tag: }}}
callback: jQuery36107855882184027965_1697993173593
URL: https://ia601906.us.archive.org/fulltext/inside.php?item_id=nintendo-magazine-system-uk-43-april-1996&doc=Nintendo%20Magazine%20System%20(UK)%2043%20April%201996&path=/27/items/nintendo-magazine-system-uk-43-april-1996&q=%22mario%20rpg%22&pre_tag=%7B%7B%7B&post_tag=%7D%7D%7D&callback=jQuery36107855882184027965_1697993173593
Relevant url?
Example of an item experiencing this error: https://archive.org/details/nintendo-magazine-system-uk-43-april-1996/
(This isn't explicitly an openlibrary.org url, so apologies if this is the wrong repo to file this issue under, but this endpoint is covered under the openlibrary API documentation, and I don't know which of the 200+ repos owned by internetarchive might contain "inside.py" or "inside.php".)
Steps to Reproduce
- I was searching in archive.org for items containing the text "mario rpg" (in quotes): https://archive.org/search?query=%22mario+rpg%22&sin=TXT
- The above linked item appears as a search result. Navigating into the item immediately shows the above screenshotted error.
- Loading the item URL in a new tab and searching for the quoted text displays the same error.
- Actual: The developer console displays the above ia-sentry.min.js Javascript error and the network console should display a successful 200 response from the above /inside.php request.
- Expected: The left side pane would show search results successfully.
Details
- Logged in (Y/N)? Y
- Browser type/version? Chrome 118.0.5993.70 ARM64
- Operating system? Mac OS Ventura 13.5.2
- Environment (prod/dev/local)? prod
Proposal & Constraints
The URL constructed in inside.py should be fully encoded, including the params. (probably simple enough to not need an example, but in the interest of meeting the requirements: https://stackoverflow.com/a/69811079/5306408)
Related files
I can't actually find them. There's an inside.php and inside.py and one or the other is likely to be the culprit, but they don't appear to be in this repository. I opened the issue here because that endpoint is covered by the openlibrary API docs.
Stakeholders
Transferring this issue to BookReader repo :)
@mekarpeles Thank you! Although, I don't see an inside.php or inside.py in this repo either.
Hello everyone can anyone please help in the project deployment process I am facing some issues