quarchive
quarchive copied to clipboard
Some downloaded pages aren't indexed properly
Description
Some pages don't index properly, possibly because they have no Content-Type header
Steps to reproduce
- Bookmark some pages
- Get them crawled
- Try to index them
Expected result
Indexed properly
Actual result
Errors out, not indexed
Additional details
[2020-03-09 11:21:12,420: ERROR/ForkPoolWorker-4] Task quarchive.ensure_fulltext[39d7d85f-0086-4203-ab90-8215894e2ba1] raised unexpected: TypeError('can only concatenate str (not "NoneType") to str')
Traceback (most recent call last):
File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/celery/app/trace.py", line 650, in __protected_call__
return self.run(*args, **kwargs)
File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/quarchive/__init__.py", line 1029, in ensure_fulltext
content_type, _ = cgi.parse_header(content_type_header)
File "/usr/lib/python3.7/cgi.py", line 243, in parse_header
parts = _parseparam(';' + line)
TypeError: can only concatenate str (not "NoneType") to str
First step is to get the mimetype right as that is needed before parsing
- [x] If no mimetype, infer
- [x] If invalid mimetype, infer
- [ ] If valid but wrong mimetype, try parsing, then infer and try one more time
Next step is to get the charset right
- [ ] If no charset and mimetype is html-family, try parsing, then infer and try one more time
- [ ] If not charset and mimetype otherwise, infer
- [ ] If valid but wrong mimetype, try parsing, then infer and try one more time
- [ ] If invalid charset, infer
- [ ] If valid, pass charset to indexer