Describe the bug

I followed the example of bedrock https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/bedrock/smart_scraper_bedrock.py It was working in the first place. Then after I replace the url from source="https://perinim.github.io/projects/", to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I got the following errors:

Traceback (most recent call last): File "c:\GitHub\job-scraper-poc\Test_Code\ai-scraper_bedrock_example.py", line 46, in result = smart_scraper_graph.run() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 120, in run self.final_state, self.execution_info = self.graph.execute(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 224, in execute return self._execute_standard(initial_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 153, in _execute_standard raise e File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 140, in _execute_standard result = current_node.execute(state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\nodes\rag_node.py", line 118, in execute index = FAISS.from_documents(chunked_docs, embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_core\vectorstores.py", line 550, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts embeddings = embedding.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 169, in embed_documents response = self._embedding_func(text) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 150, in _embedding_func raise ValueError(f"Error raised by inference endpoint: {e}") ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try again.

Looks like current version of bedrock can't handle a website which has a long context of html?

Jun 20 '24 11:06 TomZhaoJobadder

Did you change the embedder from the default cohere.embed-multilingual-v3 used in the example? Cohere has a 512-token context window, and ScrapeGraph should chunk the request accordingly by default. None of the embedders currently supported by ScrapeGraph for Bedrock has a context window of 2048 tokens, so I can't figure out what's being used, and neither can ScrapeGraph. If you didn't change it, then there's either something wrong with ScrapeGraph, or a new breaking change in the Bedrock API.

Jun 21 '24 12:06 f-aguzzi

hi @f-aguzzi Thanks for getting back to me.

Here is my code `""" Basic example of scraping pipeline using SmartScraper """ import os from dotenv import load_dotenv from langchain_aws import BedrockEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info import boto3 load_dotenv()

************************************************

Define the configuration for the graph

************************************************

graph_config = { "llm": {
"model": "bedrock/anthropic.claude-3-haiku-20240307-v1:0", "temperature": 0.0,
}, "embeddings": { "model": "bedrock/cohere.embed-english-v3"
} }

************************************************

Create the SmartScraperGraph instance and run it

************************************************

smart_scraper_graph = SmartScraperGraph( prompt="List me all the job names from the page.", # also accepts a string with the already downloaded HTML code #source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", source="https://perinim.github.io/projects/", config=graph_config
) result = smart_scraper_graph.run() print(result)

************************************************

Get graph execution info

************************************************

graph_exec_info = smart_scraper_graph.get_execution_info() print(prettify_exec_info(graph_exec_info)) ` I used bedrock/cohere.embed-english-v3, it should be similar to cohere.embed-multilingual-v3. Please note in my code if I use source="https://perinim.github.io/projects/",
it is working fine. But if I change it to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I encountered the error I mentioned in the ticket.

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html The token for each text is 512, and each token is 4 characters. So the maximum characters are 2048. If there is a long text string in AI scraper, it should be cut into several small chunks. Each of them should be less than 512 tokens (2048 characters) and then send them to Bedrock. Please refer to the code example here https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html#api-inference-examples-cohere-embed

Jun 22 '24 12:06 TomZhaoJobadder

I'll add this embedding model to the tokens dictionary, and it will be included in the next release. In the meanwhile, I'll post a temporary solution to this problem here in the comments in a few hours.

Jun 24 '24 17:06 f-aguzzi

Wait, it' already in the tokens dictionary. This will need some proper debbuging. To narrow down the problem, do you know whether the other bedrock embedders work properly or not?

Jun 24 '24 20:06 f-aguzzi

Thanks, hopefully you can fix it soon. Both bedrock/cohere.embed-english-v3 and cohere.embed-multilingual-v3 have the same issue. I didn't try other embedding models.

Jun 27 '24 06:06 TomZhaoJobadder

hi, I figured out the error, I will fix in the next days

Jul 01 '24 20:07 VinciGit00

I have the same error, which of embedding models should I use ??

`

embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
    api_key="#######",
    model_name="sentence-transformers/all-MiniLM-l6-v2"
)

graph_config = {
    "llm": {
        "api_key": "#######",
        "model": "claude-3-haiku-20240307",
        "max_tokens": 4000
    },"embeddings": {
        "model_instance": embedder_model_instance
    }}


smart_scraper_graph = SmartScraperGraph( prompt="data needed from each page is: (Title, subtitle [if any], content [article or text])",
   source="https://www.fm.gov.om/policy-ar/foreign-policy-ar/?lang=ar",
   config=graph_config,
   # schema=schema
)
result = smart_scraper_graph.run()
print(result)

`

Jul 03 '24 08:07 mjid13

Hi please update to the new version

Jul 16 '24 08:07 VinciGit00

@VinciGit00 I am using Nodejs lib for this, and getting same error. Could you explain what as causing the issue, or fix commit?

Dec 01 '24 13:12 gkirill

Scrapegraph-ai
Scrapegraph-ai copied to clipboard

BedRock Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try agai

************************************************

Define the configuration for the graph

************************************************

************************************************

Create the SmartScraperGraph instance and run it

************************************************

************************************************

Get graph execution info

************************************************

Scrapegraph-ai Scrapegraph-ai copied to clipboard

BedRock Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try agai

************************************************

Define the configuration for the graph

************************************************

************************************************

Create the SmartScraperGraph instance and run it

************************************************

************************************************

Get graph execution info

************************************************

Scrapegraph-ai
Scrapegraph-ai copied to clipboard