markdown-crawler cannot access local variable 'main_content' where it is not associated with a value

i just ran into this issue when crawling https://docs.xendit.co/.

command:

markdown-crawler --max-depth 10 --num-threads 5 --base-dir ./xendit-docs --domain-match --base-path-match https://docs.xendit.co/

issue:

Exception in thread Thread-1 (worker):
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1052, in _bootstrap_inner
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 989, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 255, in worker
    child_urls = crawl(
                 ^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 120, in crawl
    content = get_target_content(soup, target_content=target_content)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/markdown_crawler/__init__.py", line 184, in get_target_content
    content = str(main_content)
                  ^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'main_content' where it is not associated with a value

Jun 16 '24 23:06 assumednormal

got similar error while scraping:

Exception in thread Thread-1 (worker):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 255, in worker
    child_urls = crawl(
  File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 120, in crawl
    content = get_target_content(soup, target_content=target_content)
  File "/usr/local/lib/python3.10/dist-packages/markdown_crawler/__init__.py", line 184, in get_target_content
    content = str(main_content)
UnboundLocalError: local variable 'main_content' referenced before assignment
INFO:markdown_crawler:🏁 All threads have finished

How to fix it

Jan 07 '25 11:01 Aman-Singh-Kushwaha

Changing get_target_content to this resolved this issue for me:

def get_target_content( soup: BeautifulSoup, target_content: Union[List[str], None] = None ) -> str: content = '' main_content = None # Initialize main_content

# -------------------------------------
# Get target content by target selector
# -------------------------------------
if target_content:
    for target in target_content:
        for tag in soup.select(target):
            content += f'{str(tag)}'.replace('\n', '')

# ---------------------------
# Naive estimation of content
# ---------------------------
else:
    max_text_length = 0
    for tag in soup.find_all(DEFAULT_TARGET_CONTENT):
        text_length = len(tag.get_text())
        if text_length > max_text_length:
            max_text_length = text_length
            main_content = tag

    if main_content is not None:  # Only set content if main_content was found
        content = str(main_content)

return content if len(content) > 0 else False

Jan 07 '25 13:01 gabelul

When I ran a debug I had no content because I was getting a 403. I had to modify the code to put in a user-agent header in order for the site to start responding with 200 responses.

Jan 11 '25 18:01 Tiberriver256