Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Empty JSON Output for prompt using LLM

Open SumanthMeenan opened this issue 11 months ago • 12 comments

I want to extract all providers listed in this url https://www.aetna.com/dsepublic/#/contentPage?page=providerResults&parameters=searchText%3D'Primary%20Care%20Physician%20(PCP)';isGuidedSearch%3Dtrue&site_id=asa&language=en

I can use selenium, Bsoup etc., but came across this tool I'm getting empty array as output Found providers ['openai', 'azure_openai'] for model gpt-3.5-turbo-0125, using openai. If it was not intended please specify the model provider in the graph configuration --- Executing Fetch Node --- --- (Fetching HTML from: https://www.aetna.com/dsepublic/#/contentPage?page=providerResults&parameters=searchText%3D'Primary%20Care%20Physician%20(PCP)';isGuidedSearch%3Dtrue&site_id=asa&language=en) --- --- Executing ParseNode Node --- Error occurred: not enough values to unpack (expected 2, got 0)

SumanthMeenan avatar Jan 15 '25 16:01 SumanthMeenan

Hello @SumanthMeenan! I'm here to help you with any bugs, questions, or contributions you may have. Let's work together to solve this issue.

The error "not enough values to unpack (expected 2, got 0)" typically occurs when the code is trying to unpack a sequence (like a tuple or list) into variables, but the sequence doesn't contain enough elements. In the context of Scrapegraph-ai, this might happen during the parsing stage if the expected data structure isn't being returned from the Fetch Node.

Here are a few steps you can take to troubleshoot and resolve this issue:

  1. Check the Fetch Node Output: Ensure that the Fetch Node is correctly retrieving the HTML content. Since the URL involves dynamic content, make sure that the Fetch Node is configured to handle JavaScript-rendered content, possibly using a headless browser like ChromiumLoader.

  2. Verify the ParseNode Configuration: Ensure that the ParseNode is correctly configured to handle the structure of the HTML content. If the HTML structure doesn't match the expected format, it could lead to unpacking errors.

  3. Model Provider Configuration: The message suggests specifying the model provider in the graph configuration. Ensure that the correct model provider is set in your configuration to avoid any issues related to model selection.

  4. Debugging: Add logging or print statements to check the intermediate outputs between nodes. This can help identify where the data structure might not be as expected.

If these steps don't resolve the issue, you might need to inspect the specific implementation details of the Fetch and Parse nodes to ensure they align with the webpage's content structure.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Jan 15 '25 16:01 dosubot[bot]

can you fetch by yourself the code?

VinciGit00 avatar Jan 15 '25 16:01 VinciGit00

This issue often happens, I would say 50% of the time. I'm using SearchGraph with 20 results, from SearXNG. The whole pipeline crashes, which is totally unpractical.

A simple condition would exclude the doc. Or, implement a retry mechanism for empty page_content.

clemlesne avatar Jan 17 '25 12:01 clemlesne

This happens often and random for me as well. This should be a recoverable error.

Kilowhisky avatar Jan 18 '25 22:01 Kilowhisky

which model are you using?

VinciGit00 avatar Jan 18 '25 22:01 VinciGit00

Here's the output of the error:

I'm on GPT 4o-mini with duckduckgo and scrape_do (but i've also seen it on local chrome) i've also seen it with other combos as well

"errorMessage": "Traceback (most recent call last):
  File \"/var/task/index.py\", line 128, in scrape
    result = smart_scraper_graph.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/search_graph.py\", line 110, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 358, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 303, in _execute_standard
    raise e
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 276, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 200, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 77, in execute
    state = asyncio.run(self._async_execute(state, batchsize))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/runners.py\", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/runners.py\", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/base_events.py\", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 141, in _async_execute
    answers = await tqdm.gather(
              ^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/tqdm/asyncio.py\", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File \"/usr/lib/python3.12/asyncio/tasks.py\", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/tqdm/asyncio.py\", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/graph_iterator_node.py\", line 131, in _async_run
    return await asyncio.to_thread(graph.run)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/asyncio/threads.py\", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.12/concurrent/futures/thread.py\", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/smart_scraper_graph.py\", line 292, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 358, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 303, in _execute_standard
    raise e
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 276, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/graphs/base_graph.py\", line 200, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/nodes/parse_node.py\", line 96, in execute
    chunks = split_text_into_chunks(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/scrapegraphai/utils/split_text_into_chunks.py\", line 30, in split_text_into_chunks
    chunks = chunk(
             ^^^^^^
  File \"/usr/local/lib/python3.12/dist-packages/semchunk/semchunk.py\", line 186, in chunk
    chunks, offsets = zip(*[(chunk, offset) for chunk, offset in zip(chunks, offsets) if chunk and not chunk.isspace()]) if chunks else ([], []) # NOTE `if chunks else ([], [])` ensures that we don't unpack an empty list if there's no chunks (i.e., if the provided text was empty).
    ^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 0)
"

Kilowhisky avatar Jan 18 '25 23:01 Kilowhisky

I am using SearchGraph and coming across the same error when using 'gpt-4o-mini'. I do not get the error when using 'gpt-4o', however, gpt-4o is far too expensive for most people.

Jdh235 avatar Jan 25 '25 01:01 Jdh235

Hey everyone, sincerest apologies about this, I've just fixed this in v3.0.4 of semchunk (please run pip install semchunk --upgrade to benefit from the new version, plus there's some minor performance enhancements you'll benefit from v3.0.3) and I've ensured there is now a unit test for this edge case.

The problem was that I was filtering out empty chunks including chunks consisting entirely of whitespace and then trying to unpack those filtered chunks which doesn't work if there's nothing to unpack.

PS If you ever do get any other issues with semchunk or other libraries, it is often better to raise those issues on the source library to make their creators aware. Thank you @Jdh235 for doing exactly that!

umarbutler avatar Feb 13 '25 23:02 umarbutler

I had the same error as mentioned. After the upgrade, the error does not appear anymore, but instead the returned content is hallucinated and has nothing to do with the web page I am trying to scrape.

dschuld avatar Feb 15 '25 22:02 dschuld

I had the same error as mentioned. After the upgrade, the error does not appear anymore, but instead the returned content is hallucinated and has nothing to do with the web page I am trying to scrape.

This sounds like a Searchgraph bug. Presumably if no input is being provided to semchunk then you’re going to get no output and if you get no output then I guess it makes sense the model might start to hallucinate.

umarbutler avatar Feb 16 '25 00:02 umarbutler

@umarbutler what do you suggests as solution?

VinciGit00 avatar Feb 16 '25 09:02 VinciGit00

@umarbutler what do you suggests as solution?

To be honest, I'm not familiar with scrape graph myself, so I can't help out here.

This is a better question for the devs. @DPende @LorenzoPaleari

umarbutler avatar Feb 16 '25 10:02 umarbutler

Hi, @SumanthMeenan. I'm Dosu, and I'm helping the Scrapegraph-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue involves extracting provider information using LLM models, resulting in an empty JSON array and an unpacking error.
  • Troubleshooting steps were provided, including checking Fetch Node output and ParseNode configuration.
  • Similar issues were reported by other users, suggesting a frequent error and the need for a retry mechanism.
  • A fix was released in semchunk v3.0.4, resolving the unpacking error.
  • Post-upgrade, unrelated content was returned, indicating a potential SearchGraph bug.

Next Steps:

  • Please confirm if this issue is still relevant to the latest version of the Scrapegraph-ai repository by commenting here.
  • If no updates are provided, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar May 18 '25 16:05 dosubot[bot]