quarchive icon indicating copy to clipboard operation
quarchive copied to clipboard

Some downloaded pages aren't indexed properly

Open calpaterson opened this issue 4 years ago • 1 comments

Description

Some pages don't index properly, possibly because they have no Content-Type header

Steps to reproduce

  1. Bookmark some pages
  2. Get them crawled
  3. Try to index them

Expected result

Indexed properly

Actual result

Errors out, not indexed

Additional details

[2020-03-09 11:21:12,420: ERROR/ForkPoolWorker-4] Task quarchive.ensure_fulltext[39d7d85f-0086-4203-ab90-8215894e2ba1] raised unexpected: TypeError('can only concatenate str (not "NoneType") to str')
Traceback (most recent call last):
  File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/quarchive/quarchive-venv/lib/python3.7/site-packages/quarchive/__init__.py", line 1029, in ensure_fulltext
    content_type, _ = cgi.parse_header(content_type_header)
  File "/usr/lib/python3.7/cgi.py", line 243, in parse_header
    parts = _parseparam(';' + line)
TypeError: can only concatenate str (not "NoneType") to str

calpaterson avatar Mar 09 '20 11:03 calpaterson

First step is to get the mimetype right as that is needed before parsing

  • [x] If no mimetype, infer
  • [x] If invalid mimetype, infer
  • [ ] If valid but wrong mimetype, try parsing, then infer and try one more time

Next step is to get the charset right

  • [ ] If no charset and mimetype is html-family, try parsing, then infer and try one more time
  • [ ] If not charset and mimetype otherwise, infer
  • [ ] If valid but wrong mimetype, try parsing, then infer and try one more time
  • [ ] If invalid charset, infer
  • [ ] If valid, pass charset to indexer

calpaterson avatar Apr 26 '20 11:04 calpaterson