markdown-crawler
markdown-crawler copied to clipboard
cannot access local variable 'main_content' where it is not associated with a value
i just ran into this issue when crawling https://docs.xendit.co/.
command:
markdown-crawler --max-depth 10 --num-threads 5 --base-dir ./xendit-docs --domain-match --base-path-match https://docs.xendit.co/
issue:
Exception in thread Thread-1 (worker):
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1052, in _bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 989, in run
self._target(*self._args, **self._kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 255, in worker
child_urls = crawl(
^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 120, in crawl
content = get_target_content(soup, target_content=target_content)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 184, in get_target_content
content = str(main_content)
^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'main_content' where it is not associated with a value
got similar error while scraping:
Exception in thread Thread-1 (worker):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 255, in worker
child_urls = crawl(
File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 120, in crawl
content = get_target_content(soup, target_content=target_content)
File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 184, in get_target_content
content = str(main_content)
UnboundLocalError: local variable 'main_content' referenced before assignment
INFO:markdown_crawler:π All threads have finished
How to fix it
Changing get_target_content to this resolved this issue for me:
def get_target_content( soup: BeautifulSoup, target_content: Union[List[str], None] = None ) -> str: content = '' main_content = None # Initialize main_content
# -------------------------------------
# Get target content by target selector
# -------------------------------------
if target_content:
for target in target_content:
for tag in soup.select(target):
content += f'{str(tag)}'.replace('\n', '')
# ---------------------------
# Naive estimation of content
# ---------------------------
else:
max_text_length = 0
for tag in soup.find_all(DEFAULT_TARGET_CONTENT):
text_length = len(tag.get_text())
if text_length > max_text_length:
max_text_length = text_length
main_content = tag
if main_content is not None: # Only set content if main_content was found
content = str(main_content)
return content if len(content) > 0 else False
When I ran a debug I had no content because I was getting a 403. I had to modify the code to put in a user-agent header in order for the site to start responding with 200 responses.