Scrapegraph-ai
Scrapegraph-ai copied to clipboard
BedRock Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try agai
Describe the bug
I followed the example of bedrock https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/bedrock/smart_scraper_bedrock.py It was working in the first place. Then after I replace the url from source="https://perinim.github.io/projects/", to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I got the following errors:
Traceback (most recent call last):
File "c:\GitHub\job-scraper-poc\Test_Code\ai-scraper_bedrock_example.py", line 46, in
Looks like current version of bedrock can't handle a website which has a long context of html?
Did you change the embedder from the default cohere.embed-multilingual-v3 used in the example? Cohere has a 512-token context window, and ScrapeGraph should chunk the request accordingly by default. None of the embedders currently supported by ScrapeGraph for Bedrock has a context window of 2048 tokens, so I can't figure out what's being used, and neither can ScrapeGraph. If you didn't change it, then there's either something wrong with ScrapeGraph, or a new breaking change in the Bedrock API.
hi @f-aguzzi Thanks for getting back to me.
Here is my code `""" Basic example of scraping pipeline using SmartScraper """ import os from dotenv import load_dotenv from langchain_aws import BedrockEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info import boto3 load_dotenv()
************************************************
Define the configuration for the graph
************************************************
graph_config = {
"llm": {
"model": "bedrock/anthropic.claude-3-haiku-20240307-v1:0",
"temperature": 0.0,
},
"embeddings": {
"model": "bedrock/cohere.embed-english-v3"
}
}
************************************************
Create the SmartScraperGraph instance and run it
************************************************
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the job names from the page.",
# also accepts a string with the already downloaded HTML code
#source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate",
source="https://perinim.github.io/projects/",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
************************************************
Get graph execution info
************************************************
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
`
I used bedrock/cohere.embed-english-v3, it should be similar to cohere.embed-multilingual-v3.
Please note in my code if I use
source="https://perinim.github.io/projects/",
it is working fine. But if I change it to
source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate",
I encountered the error I mentioned in the ticket.
Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html The token for each text is 512, and each token is 4 characters. So the maximum characters are 2048. If there is a long text string in AI scraper, it should be cut into several small chunks. Each of them should be less than 512 tokens (2048 characters) and then send them to Bedrock. Please refer to the code example here https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html#api-inference-examples-cohere-embed
I'll add this embedding model to the tokens dictionary, and it will be included in the next release. In the meanwhile, I'll post a temporary solution to this problem here in the comments in a few hours.
Wait, it' already in the tokens dictionary. This will need some proper debbuging. To narrow down the problem, do you know whether the other bedrock embedders work properly or not?
Thanks, hopefully you can fix it soon. Both bedrock/cohere.embed-english-v3 and cohere.embed-multilingual-v3 have the same issue. I didn't try other embedding models.
hi, I figured out the error, I will fix in the next days
I have the same error, which of embedding models should I use ??
`
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
api_key="#######",
model_name="sentence-transformers/all-MiniLM-l6-v2"
)
graph_config = {
"llm": {
"api_key": "#######",
"model": "claude-3-haiku-20240307",
"max_tokens": 4000
},"embeddings": {
"model_instance": embedder_model_instance
}}
smart_scraper_graph = SmartScraperGraph( prompt="data needed from each page is: (Title, subtitle [if any], content [article or text])",
source="https://www.fm.gov.om/policy-ar/foreign-policy-ar/?lang=ar",
config=graph_config,
# schema=schema
)
result = smart_scraper_graph.run()
print(result)
`
Hi please update to the new version
@VinciGit00 I am using Nodejs lib for this, and getting same error. Could you explain what as causing the issue, or fix commit?