ArchiveBot
ArchiveBot copied to clipboard
Rare crash triggered by an invalid ignore pattern: UnboundLocalError: local variable 'compiledPattern' referenced before assignment
Job 2mt13kxolzln2i6awfxyprnud crashed with this traceback:
Pattern ^https?://www\.pinterest.\com/.*\.js$ is invalid (error: bad escape \c at position 25). Ignored.
ERROR Fatal exception.
Traceback (most recent call last):
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/app.py", line 157, in run
yield from pipeline.process()
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 194, in process
yield from self._process_one_worker()
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
task.result()
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 119, in process
item = yield from self.process_one(_worker_id=worker_id)
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
yield from task.process(item)
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/tasks/download.py", line 492, in process
yield from session.app_session.factory['Processor'].process(session)
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/delegate.py", line 29, in process
return (yield from processor.process(item_session))
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 92, in process
return (yield from session.process())
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 174, in process
ok = yield from self._process_robots()
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 201, in _process_robots
request))
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/web.py", line 367, in _should_fetch_reason_with_robots
self._fetch_rule.check_initial_web_request(self._item_session, request)
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/rule.py", line 179, in check_initial_web_request
item_session, verdict, reason, test_info
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/processor/rule.py", line 130, in consult_hook
PluginFunctions.accept_url, item_session, verdict, reasons,
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/hook.py", line 81, in call
return self._callbacks[name](*args, **kwargs)
File "/home/archivebot/.pyenv/versions/3.6.15/envs/archivebot-20241016/lib/python3.6/site-packages/wpull/application/plugin.py", line 49, in wrapper
return func(*args, **kwargs)
File "archive_bot_plugin.py", line 227, in accept_url
pattern = self.settings.ignore_url(item_session.url_record)
File "/home/archivebot/ArchiveBot-c/pipeline/archivebot/wpull/settings.py", line 50, in ignore_url
return self.ignoracle.ignores(record_info)
File "/home/archivebot/ArchiveBot-c/pipeline/archivebot/wpull/ignoracle.py", line 110, in ignores
self._compiled.append((pattern, compiledPattern))
UnboundLocalError: local variable 'compiledPattern' referenced before assignment
This crash will only happen when the invalid ignore pattern appears first in the pattern set iterator. Otherwise, the previous ignore pattern will be duplicated (which causes no harm apart from a very minor performance impact).
The fix is that the exception handler needs to continue.