Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

BedRock Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try agai

Open TomZhaoJobadder opened this issue 1 year ago • 6 comments

Describe the bug

I followed the example of bedrock https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/bedrock/smart_scraper_bedrock.py It was working in the first place. Then after I replace the url from source="https://perinim.github.io/projects/", to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I got the following errors:

Traceback (most recent call last): File "c:\GitHub\job-scraper-poc\Test_Code\ai-scraper_bedrock_example.py", line 46, in result = smart_scraper_graph.run() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 120, in run self.final_state, self.execution_info = self.graph.execute(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 224, in execute return self._execute_standard(initial_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 153, in _execute_standard raise e File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 140, in _execute_standard result = current_node.execute(state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\nodes\rag_node.py", line 118, in execute index = FAISS.from_documents(chunked_docs, embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_core\vectorstores.py", line 550, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts embeddings = embedding.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 169, in embed_documents response = self._embedding_func(text) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 150, in _embedding_func raise ValueError(f"Error raised by inference endpoint: {e}") ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try again.

image

Looks like current version of bedrock can't handle a website which has a long context of html?

TomZhaoJobadder avatar Jun 20 '24 11:06 TomZhaoJobadder

Did you change the embedder from the default cohere.embed-multilingual-v3 used in the example? Cohere has a 512-token context window, and ScrapeGraph should chunk the request accordingly by default. None of the embedders currently supported by ScrapeGraph for Bedrock has a context window of 2048 tokens, so I can't figure out what's being used, and neither can ScrapeGraph. If you didn't change it, then there's either something wrong with ScrapeGraph, or a new breaking change in the Bedrock API.

f-aguzzi avatar Jun 21 '24 12:06 f-aguzzi

hi @f-aguzzi Thanks for getting back to me.

Here is my code `""" Basic example of scraping pipeline using SmartScraper """ import os from dotenv import load_dotenv from langchain_aws import BedrockEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info import boto3 load_dotenv()

************************************************

Define the configuration for the graph

************************************************

graph_config = { "llm": {
"model": "bedrock/anthropic.claude-3-haiku-20240307-v1:0", "temperature": 0.0,
}, "embeddings": { "model": "bedrock/cohere.embed-english-v3"
} }

************************************************

Create the SmartScraperGraph instance and run it

************************************************

smart_scraper_graph = SmartScraperGraph( prompt="List me all the job names from the page.", # also accepts a string with the already downloaded HTML code #source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", source="https://perinim.github.io/projects/", config=graph_config
) result = smart_scraper_graph.run() print(result)

************************************************

Get graph execution info

************************************************

graph_exec_info = smart_scraper_graph.get_execution_info() print(prettify_exec_info(graph_exec_info)) ` I used bedrock/cohere.embed-english-v3, it should be similar to cohere.embed-multilingual-v3. Please note in my code if I use source="https://perinim.github.io/projects/",
it is working fine. But if I change it to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I encountered the error I mentioned in the ticket.

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html The token for each text is 512, and each token is 4 characters. So the maximum characters are 2048. If there is a long text string in AI scraper, it should be cut into several small chunks. Each of them should be less than 512 tokens (2048 characters) and then send them to Bedrock. Please refer to the code example here https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html#api-inference-examples-cohere-embed

TomZhaoJobadder avatar Jun 22 '24 12:06 TomZhaoJobadder

I'll add this embedding model to the tokens dictionary, and it will be included in the next release. In the meanwhile, I'll post a temporary solution to this problem here in the comments in a few hours.

f-aguzzi avatar Jun 24 '24 17:06 f-aguzzi

Wait, it' already in the tokens dictionary. This will need some proper debbuging. To narrow down the problem, do you know whether the other bedrock embedders work properly or not?

f-aguzzi avatar Jun 24 '24 20:06 f-aguzzi

Thanks, hopefully you can fix it soon. Both bedrock/cohere.embed-english-v3 and cohere.embed-multilingual-v3 have the same issue. I didn't try other embedding models.

TomZhaoJobadder avatar Jun 27 '24 06:06 TomZhaoJobadder

hi, I figured out the error, I will fix in the next days

VinciGit00 avatar Jul 01 '24 20:07 VinciGit00

I have the same error, which of embedding models should I use ??

`

embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
    api_key="#######",
    model_name="sentence-transformers/all-MiniLM-l6-v2"
)

graph_config = {
    "llm": {
        "api_key": "#######",
        "model": "claude-3-haiku-20240307",
        "max_tokens": 4000
    },"embeddings": {
        "model_instance": embedder_model_instance
    }}


smart_scraper_graph = SmartScraperGraph( prompt="data needed from each page is: (Title, subtitle [if any], content [article or text])",
   source="https://www.fm.gov.om/policy-ar/foreign-policy-ar/?lang=ar",
   config=graph_config,
   # schema=schema
)
result = smart_scraper_graph.run()
print(result)

`

mjid13 avatar Jul 03 '24 08:07 mjid13

Hi please update to the new version

VinciGit00 avatar Jul 16 '24 08:07 VinciGit00

@VinciGit00 I am using Nodejs lib for this, and getting same error. Could you explain what as causing the issue, or fix commit?

gkirill avatar Dec 01 '24 13:12 gkirill