Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

ScriptCreatorGraph example does not work

Open epage480 opened this issue 1 year ago • 9 comments

Description The ScriptCreatorGraph will fail to produce a useful script under any circumstances.

Steps to reproduce the behavior:

  1. Run examples/openai/script_generator_openai.py
  2. Examine and attempt to run the produced script

Expected behavior The produced script should at least attempt to pull information from existing web elements

Additional context At a glance you might assume that the LLM has hallucinated and failed to produce a useful script, but this isn't the case. The root of the problem is the LLM is only being shown the parsed text from the webpage source and not being provided the actual HTML, making it impossible to create a useful scraping script. ParseNode is the culprit here, by running the documents through the Html2TextTransformer().transform_documents() function it removes all html syntax.

So far I've made adjustments to ParseNode to allow for the raw html to pass through it but this results in a snowball of other errors I'm still unraveling.

epage480 avatar May 08 '24 01:05 epage480

Agreed, this node seems quite experimental, came here after experiencing the exact same thing. What I also realized is that it was unnecessarily hard to inspect data through the steps for debugging.

Love the initiative of the project and please don't be discouraged!

caj-larsson avatar May 08 '24 09:05 caj-larsson

I got some success by

  1. Removing bypassing the parse node in the graph
  2. changing the template to also include the context
  3. removing the chunking in the rag node by simply setting the chunked_docs = doc instead, this effectively disables the rag node tho.

The result I got with llama 3 7B was not usable however as it is mistaken where the link node is after making to deep selection for items.

caj-larsson avatar May 08 '24 11:05 caj-larsson

Man please make the pr for improving! Update the graph and if it works we will add to our baseline

VinciGit00 avatar May 08 '24 12:05 VinciGit00

Your contribute could help to grow the library!

VinciGit00 avatar May 08 '24 12:05 VinciGit00

Working on it

epage480 avatar May 08 '24 12:05 epage480

The thing is this a deeper problem than "just making it work", the approach here is more like document extraction. For this to work to a usable degree I suspect we need to develop a loop that executes the script.

Regardless of what we need to make it work, it's hard to determine if something is better if there is no evaluation score.

caj-larsson avatar May 08 '24 12:05 caj-larsson

look at this node https://github.com/VinciGit00/Scrapegraph-ai/blob/pre/beta/scrapegraphai/nodes/graph_iterator_node.py for iterating

VinciGit00 avatar May 08 '24 13:05 VinciGit00

Submitted a pull request, it's not perfect but it will at least get the example and other small websites working.

epage480 avatar May 11 '24 00:05 epage480

hey, please try the new beta

VinciGit00 avatar May 12 '24 16:05 VinciGit00