crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Incomplete Extraction

Open Reasat opened this issue 11 months ago â€ĸ 8 comments

crawl4ai version

0.4.247

Expected Behavior

Hello, thanks for this amazing software. I am trying to scrape some MCQ data from this website [link] (https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/) using the LLMExtractionStrategy strategy. There are 200 questions in the website and the response is cutting out at around 152. I can't see any errors after execution. I am applying chunking. How should I go about debugging this? The code and output is attached below. Thanks in advance!

Note: For another page with 100 MCQs this code worked fine.

Current Behavior

There are 200 questions in the website and the response is cutting out at around 150.

Is this reproducible?

Yes

Inputs Causing the Bug

https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/

Steps to Reproduce


Code snippets

import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import json
from tqdm.auto import tqdm
# import litellm
# litellm.drop_params = True

class MySchema(BaseModel):
    question: str = Field(..., description="question")
    answer: str = Field(..., description="answer to the question")
    analytics: str = Field(..., description="analytics related to the question")

async def main(url):
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
            provider = "openai/gpt-4o-mini-2024-07-18",
            api_token=os.getenv('OPENAI_API_KEY'), 
            schema=MySchema.model_json_schema(),
            extraction_type="schema",
            verbose = True,
            apply_chunking = True,
            # chunk_token_threshold = 2048,
            # overlap_rate = 0.1,
            input_format = 'cleaned_html',
            extra_args={"temperature": 0.0},
            # extra_args={"temperature": 0.0},
            instruction="""You are an assitant for extracting MCQ questions from webcrawled data. 
            The questions can be in Bengali or in English.
            The crawled content has MCQ questions. Each starting with question denoted as āĻĒā§āϰāĻļā§āύ and correct answers denoted as āϏāĻ āĻŋāĻ• āωāĻ¤ā§āϤāϰ as well as analytics
            of what the exam participants scored given by the "Live MCQ Analytics" field. From the crawled content, 
            extract all mentioned questions (āĻĒā§āϰāĻļā§āύ), answers (āϏāĻ āĻŋāĻ• āωāĻ¤ā§āϤāϰ), and analytics (Live MCQ Analyticsâ„ĸ). 
            One example can is given below. 
            Given an input, 
            <strong>āĻĒā§āϰāĻļā§āύ ⧍ā§Ļ. ‘āĻ•āĻŦāĻ°â€™ āύāĻžāϟāĻ•āϟāĻŋāϰ āϞ⧇āĻ–āĻ•-</strong><br> āĻ•) āϜāϏ⧀āĻŽāωāĻĻā§â€ŒāĻĻā§€āύ<br> āĻ–) āύāϜāϰ⧁āϞ āχāϏāϞāĻžāĻŽ<br> āĻ—) āĻŽā§āύ⧀āϰ āϚ⧌āϧ⧁āϰ⧀<br> āϘ) āĻĻā§āĻŦāĻŋāĻœā§‡āĻ¨ā§āĻĻā§āϰāϞāĻžāϞ āϰāĻžā§Ÿ</p><p><strong>āϏāĻ āĻŋāĻ• āωāĻ¤ā§āϤāϰ:</strong>&nbsp;āĻ—) āĻŽā§āύ⧀āϰ āϚ⧌āϧ⧁āϰ⧀</p><p><strong>Live MCQ Analyticsâ„ĸ:</strong>&nbsp;āϏāĻ āĻŋāĻ• āωāĻ¤ā§āϤāϰāĻĻāĻžāϤāĻž: 76%, āϭ⧁āϞ āωāĻ¤ā§āϤāϰāĻĻāĻžāϤāĻž: 13%, āωāĻ¤ā§āϤāϰ āĻ•āϰ⧇āύāύāĻŋ: 10%</p><p><strong>āĻŦā§āϝāĻžāĻ–ā§āϝāĻž:</strong>&nbsp;<i>āĻāχ āĻĒā§āϰāĻļā§āύ āϏāĻš āĻ•āϝāĻŧ⧇āĻ• āϞāĻžāĻ– āĻĒā§āϰāĻļā§āύ⧇āϰ āĻ…āĻĨ⧇āύāϟāĻŋāĻ• āĻŦā§āϝāĻžāĻ–ā§āϝāĻž āĻĻ⧇āĻ–āϤ⧇ Live MCQ āĻ…ā§āϝāĻžāĻĒ āχāĻ¨ā§āϏāϟāϞ āĻ•āϰ⧁āύāĨ¤</i>         
            the output should be,
            {"question":"āĻĒā§āϰāĻļā§āύ ⧍ā§Ļ. ‘āĻ•āĻŦāĻ°â€™ āύāĻžāϟāĻ•āϟāĻŋāϰ āϞ⧇āĻ–āĻ•-
            āĻ•) āϜāϏ⧀āĻŽāωāĻĻā§â€ŒāĻĻā§€āύ 
            āĻ–) āύāϜāϰ⧁āϞ āχāϏāϞāĻžāĻŽ 
            āĻ—) āĻŽā§āύ⧀āϰ āϚ⧌āϧ⧁āϰ⧀ 
            āϘ) āĻĻā§āĻŦāĻŋāĻœā§‡āĻ¨ā§āĻĻā§āϰāϞāĻžāϞ āϰāĻžā§Ÿ",
            "answer": "āĻ—) āĻŽā§āύ⧀āϰ āϚ⧌āϧ⧁āϰ⧀",
            "analytics": "āϏāĻ āĻŋāĻ• āωāĻ¤ā§āϤāϰāĻĻāĻžāϤāĻž: 76%, āϭ⧁āϞ āωāĻ¤ā§āϤāϰāĻĻāĻžāϤāĻž: 13%, āωāĻ¤ā§āϤāϰ āĻ•āϰ⧇āύāύāĻŋ: 10%",}."""
        ),            
        cache_mode=CacheMode.BYPASS,

    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            # url='https://openai.com/api/pricing/',
            url = url,
            config=run_config,
            scan_full_page=True,
            scroll_delay=0.2,
            remove_overlay_elements = True
        )
        # print(result.html)
        
        data = json.loads(result.extracted_content)
        flname = url.split('/')[-2]
        with open('{}.json'.format(flname), 'w', encoding='utf-8') as json_file:
            json.dump(data, json_file, ensure_ascii=False, indent=4)
        


if __name__ == "__main__":
    with open('target_urls.txt', 'r') as f:
        urls = [line.strip() for line in f.readlines()][:1]

    for url in tqdm(urls):
        asyncio.run(main(url))

OS

Linux

Python version

3.12.8

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... → Crawl4AI 0.4.247 [FETCH]... ↓ https://web.livemcq.com/job-solution/bcs-question-... | Status: True | Time: 2.56s [SCRAPE].. ◆ Processed https://web.livemcq.com/job-solution/bcs-question-... | Time: 73ms [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 0 [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 1 [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 2 INFO:httpx:HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK" /home/reasat09/miniconda3/envs/crawl4ai/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed warnings.warn(message, UserWarning) 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:37:04 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 1 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 0 INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:37:05 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 1 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 2 ^CINFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:42:10 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 152 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 1 [EXTRACT]. ■ Completed for https://web.livemcq.com/job-solution/bcs-question-... | Time: 310.29557266272604s [COMPLETE] ● https://web.livemcq.com/job-solution/bcs-question-... | Status: True | Total: 312.93s

Reasat avatar Jan 21 '25 02:01 Reasat

Im came across this bug as well.

js_code = """
    (async function () {
        let scrollDuration = 60 * 1000;  // Total duration in milliseconds (e.g., 30 seconds)
        let scrollDelay = 4 * 1000;          // Delay time between scrolls in milliseconds (e.g., 5 seconds)
        let scrollAmount = 500;              // Scroll by 500 pixels each time
        let startTime = Date.now();
        let lastScrollHeight = 0;

        while (Date.now() - startTime < scrollDuration) {
            // Scroll down by a fixed number of pixels
            window.scrollBy(0, scrollAmount);
            console.log('Scrolling down by', scrollAmount, 'pixels.');

            // Wait for lazy-loaded content to appear (delay time)
            await new Promise(resolve => setTimeout(resolve, scrollDelay));

            // Check if new content has loaded by comparing scroll height
            let currentScrollHeight = document.body.scrollHeight;

            // If the scroll height doesn't change, stop scrolling
            lastScrollHeight = currentScrollHeight;
        }

        console.log('Finished scrolling after ' + (Date.now() - startTime) / 1000 + ' seconds.');
    })();
    """,
                )

It still returned incompleted result Note: Im using crawl4ai 0.3.76 since the new version of #457 hasnt been released yet

nhaatj2804 avatar Jan 21 '25 10:01 nhaatj2804

@aravindkarnam This must be checked cross the new version.

@Reasat Thx for trying the Crawl4ai, please upgrade and check this again.

unclecode avatar Jan 28 '25 15:01 unclecode

Hello!

I'm encountering a similar issue, as shown in the following example:

Source link: https://www.medscape.com/viewarticle/food-advertised-nfl-games-loaded-salt-fat-calories-2025a10002cy

When evaluating the results, I noticed that parts of the content are missing. This seems to happen randomly.

LLM Strategy

  return LLMExtractionStrategy(
      provider=provider_config["provider"], #gpt4o-mini
      api_token=provider_config["api_key"],
      schema=MainContentModel.model_json_schema(),
      extraction_type="schema",
      instruction=prompt,
      chunk_token_threshold=1400,
      overlap_rate=0.1,
      apply_chunking=True,
      input_format="html",
      extra_args={
          "temperature": provider_config["temperature"],
          "max_tokens": provider_config["max_tokens"]
      }
  )

Model

class MainContentModel(BaseModel):
    page_type: str = Field(description="1 if webpage is about one specific item, 0 otherwise")
    title: str = Field(description="Title of the article")
    content: str = Field(description="Main article content in HTML format")
    published_at: str = Field(description="Published date in YYYY-MM-DD format")
    error: str = Field(description="1 if error occurred, 0 otherwise")

Prompt

prompt = """You are a web scraping assistant with extensive experience in extracting data from various types of webpages.
    Your expertise lies in accurately retrieving structured information while maintaining the integrity of the original content.
    Your task is to extract specific information from a given webpage. Please adhere to the following guidelines while performing the extraction:
    - page_type: Return '1' if the webpage is entirely about only one specific press release, event announcement, blog post or research article; otherwise stop the execution and return '0'.
    - title: Extract the title of the article exactly as it appears on the webpage. Ensure that it is free from any additional formatting or interpretation.
    - content: Extract the entire main article of the webpage in HTML format exactly as it appears. Maintain the original structure, including headings, paragraphs, lists, and any other HTML elements present. Format the data as HTML.
    - published_at: Extract the published date of the article in the format YYYY-MM-DD. Ensure that the date is accurately sourced from the webpage, and if multiple dates are present, select the most relevant one associated with the article.
    - error: Return '1' if there is any message or content that suggests an error occurred during the web scraping extraction, otherwise return '0'. Always use numbers 0 and 1.
"""

Sample of the missing data

Image

matheus-rossi avatar Jan 30 '25 18:01 matheus-rossi

Upon further investigation, it seems there's an issue with the links in the content.

Example 2 Image

Example 3 Image

matheus-rossi avatar Jan 30 '25 19:01 matheus-rossi

@unclecode I am not sure the chunking code written here works as intended. If a document have a total token greater than the chunk_threshold, it does not chunk the doc for me. The whole unchunked text goes to the model for processing. There should be an error or warning that tells the user this is happening. But the processing runs (probably on a truncated text?).

I actually don't understand the logic behind the chunking code. Also, I do not understand why 'chunks' are named as 'sections'. Or why the function is called 'merge' rather than 'chunking'

Here's my implementation of chunking to replace the existing function that works for my target website.

def _merge2(self, documents, chunk_token_threshold, overlap):
        """
        Merge documents into sections based on chunk_token_threshold and overlap.
        """
        # each chunk needs to have a tokens <= chunk_token_threshold with overlap

        def generate_chunk_index(start, end, step, overlap):
            ranges = []
            while start < end:
                next_start = min(start + step, end)
                ranges.append((start, next_start))
                start += step - overlap  # Move forward while maintaining overlap
            return ranges

            # Example usage:
            # start = 0
            # end = 20
            # step = 5
            # overlap = 2

            # index_ranges = generate_chunk_index(start, end, step, overlap)
            # print(index_ranges)
            # [(0, 5), (3, 8), (6, 11), (9, 14), (12, 17), (15, 20)]


        sections = []
        print('doc num', len(documents))
        for text in documents:
            # print(text)
            words = text.split(' ')
            print('words num', len(words))
            tokens_num = math.ceil(len(words)*self.word_token_rate)
            print('tokens num', tokens_num)
            chunk_ranges = generate_chunk_index(0,tokens_num, chunk_token_threshold, overlap)
            print('chunks num', len(chunk_ranges))
            for i,j in chunk_ranges:
                words_chunk = words[i:j]
                text_chunk = ' '.join(words_chunk)
                sections.append(text_chunk)
            print('sections num', len(sections))
        return sections

Reasat avatar Feb 02 '25 19:02 Reasat

@Reasat you are absolutely right, and just yesterday I noticed it. Actually I replaced with another algorithm 12x faster, check the "next" branch, and this will be available in next version.

unclecode avatar Feb 03 '25 00:02 unclecode

@Reasat Regarding the naming and purpose of the function:

We start with a list of initial string chunks:

C = [C1, C2, ..., Cn]

Each chunk Ci contains a certain number of tokens. Our goal is to transform C into a new list where each chunk has approximately CTX tokens, ensuring efficient LLM processing.

Understanding CTX:

  • CTX is the target number of tokens per call to the language model.
  • It is user-defined and determines how input is batched.
  • Example: If the total token count is 1000 and CTX = 200, we will create 5 calls (1000 / 200).

Transformation Process:

  1. Initialize an empty chunk S and iterate over C.
  2. For each chunk Ci, check if adding it keeps |S| <= CTX:

if |S| + |Ci| <= CTX: S += Ci else: output S, start new chunk with Ci

  1. Continue until all chunks are processed.

Result: A transformed list S = [S1, S2, ..., Sm] where each Si has |Si| ≈ CTX at O(n).

Why I choose "merge", tbh I don't know 🤷😄 perhaps late night coding!

unclecode avatar Feb 03 '25 00:02 unclecode

@unclecode thanks a lot for the description! I'll check out the implementation in the next branch.

Reasat avatar Feb 03 '25 02:02 Reasat