Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Python scraper based on AI

Results 147 Scrapegraph-ai issues
Sort by recently updated
recently updated
newest added

look at these sources [link](https://colab.research.google.com/github/mistralai/mistral-common/blob/main/examples/tokenizer.ipynb ) blog post [link](https://docs.mistral.ai/guides/tokenization/) For hugging face models from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") text = "Write your text here" tokens = tokenizer.tokenize(text) num_tokens...

**Describe the bug** Hi, I am trying to scrape webpage using SmartScraperGraph, but am constantly getting the following error- 'SmartScraperGraph' object has no attribute 'model_token' **To Reproduce** This is the...

bug

**Describe the bug** When using gpt4o as the llm and scraping a webpage to return a list of links, sometimes the paths returned are : - relative paths (OR) -...

I meet the problem when I run my pycharm to train some modle , but I don't know how to solve it.I use windows11 ,and it seems that libomp140.x86_64.dll is...

**Describe the bug** When running the OpenAI Deep Scraper example located at `examples/openai/deep_scraper_openai.py`, I get the error: ``` Traceback (most recent call last): File "/Users/ajt/Projects/scrapegraph_playground/openai/deep_scraper_openai.py", line 37, in deep_scraper_graph =...

**Describe the bug** When doing some crawls, I get the following error: `Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted...

**Is your feature request related to a problem? Please describe.** We can assign a url to the `source`. It would be nice if we could also pass in an headers...

Would be interesting if support was added for firecrawl.ai. They also allow to [self host](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md) their service. Firecrawl allows for cleaner crawling, they handle pdf's as well as dynamic websites.

**Describe the bug** I have this error: "No HTML body content found, please try setting the 'headless' flag to False in the graph configuration. HTML content: Error: Page.goto: Timeout 30000ms...

With concurrent request to googlesearch, receiving the following: ``` 642 def http_error_default(self, req, fp, code, msg, hdrs): --> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp) HTTPError: HTTP Error 429: Too...