[Bug]: Incomplete Extraction
crawl4ai version
0.4.247
Expected Behavior
Hello, thanks for this amazing software. I am trying to scrape some MCQ data from this website [link] (https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/) using the LLMExtractionStrategy strategy. There are 200 questions in the website and the response is cutting out at around 152. I can't see any errors after execution. I am applying chunking. How should I go about debugging this? The code and output is attached below. Thanks in advance!
Note: For another page with 100 MCQs this code worked fine.
Current Behavior
There are 200 questions in the website and the response is cutting out at around 150.
Is this reproducible?
Yes
Inputs Causing the Bug
https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/
Steps to Reproduce
Code snippets
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import json
from tqdm.auto import tqdm
# import litellm
# litellm.drop_params = True
class MySchema(BaseModel):
question: str = Field(..., description="question")
answer: str = Field(..., description="answer to the question")
analytics: str = Field(..., description="analytics related to the question")
async def main(url):
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
provider = "openai/gpt-4o-mini-2024-07-18",
api_token=os.getenv('OPENAI_API_KEY'),
schema=MySchema.model_json_schema(),
extraction_type="schema",
verbose = True,
apply_chunking = True,
# chunk_token_threshold = 2048,
# overlap_rate = 0.1,
input_format = 'cleaned_html',
extra_args={"temperature": 0.0},
# extra_args={"temperature": 0.0},
instruction="""You are an assitant for extracting MCQ questions from webcrawled data.
The questions can be in Bengali or in English.
The crawled content has MCQ questions. Each starting with question denoted as āĻĒā§āϰāĻļā§āύ and correct answers denoted as āϏāĻ āĻŋāĻ āĻāϤā§āϤāϰ as well as analytics
of what the exam participants scored given by the "Live MCQ Analytics" field. From the crawled content,
extract all mentioned questions (āĻĒā§āϰāĻļā§āύ), answers (āϏāĻ āĻŋāĻ āĻāϤā§āϤāϰ), and analytics (Live MCQ Analyticsâĸ).
One example can is given below.
Given an input,
<strong>āĻĒā§āϰāĻļā§āύ ⧍ā§Ļ. âāĻāĻŦāϰâ āύāĻžāĻāĻāĻāĻŋāϰ āϞā§āĻāĻ-</strong><br> āĻ) āĻāϏā§āĻŽāĻāĻĻā§âāĻĻā§āύ<br> āĻ) āύāĻāϰā§āϞ āĻāϏāϞāĻžāĻŽ<br> āĻ) āĻŽā§āύā§āϰ āĻā§āϧā§āϰā§<br> āĻ) āĻĻā§āĻŦāĻŋāĻā§āύā§āĻĻā§āϰāϞāĻžāϞ āϰāĻžā§</p><p><strong>āϏāĻ āĻŋāĻ āĻāϤā§āϤāϰ:</strong> āĻ) āĻŽā§āύā§āϰ āĻā§āϧā§āϰā§</p><p><strong>Live MCQ Analyticsâĸ:</strong> āϏāĻ āĻŋāĻ āĻāϤā§āϤāϰāĻĻāĻžāϤāĻž: 76%, āĻā§āϞ āĻāϤā§āϤāϰāĻĻāĻžāϤāĻž: 13%, āĻāϤā§āϤāϰ āĻāϰā§āύāύāĻŋ: 10%</p><p><strong>āĻŦā§āϝāĻžāĻā§āϝāĻž:</strong> <i>āĻāĻ āĻĒā§āϰāĻļā§āύ āϏāĻš āĻāϝāĻŧā§āĻ āϞāĻžāĻ āĻĒā§āϰāĻļā§āύā§āϰ āĻ
āĻĨā§āύāĻāĻŋāĻ āĻŦā§āϝāĻžāĻā§āϝāĻž āĻĻā§āĻāϤ⧠Live MCQ āĻ
ā§āϝāĻžāĻĒ āĻāύā§āϏāĻāϞ āĻāϰā§āύāĨ¤</i>
the output should be,
{"question":"āĻĒā§āϰāĻļā§āύ ⧍ā§Ļ. âāĻāĻŦāϰâ āύāĻžāĻāĻāĻāĻŋāϰ āϞā§āĻāĻ-
āĻ) āĻāϏā§āĻŽāĻāĻĻā§âāĻĻā§āύ
āĻ) āύāĻāϰā§āϞ āĻāϏāϞāĻžāĻŽ
āĻ) āĻŽā§āύā§āϰ āĻā§āϧā§āϰā§
āĻ) āĻĻā§āĻŦāĻŋāĻā§āύā§āĻĻā§āϰāϞāĻžāϞ āϰāĻžā§",
"answer": "āĻ) āĻŽā§āύā§āϰ āĻā§āϧā§āϰā§",
"analytics": "āϏāĻ āĻŋāĻ āĻāϤā§āϤāϰāĻĻāĻžāϤāĻž: 76%, āĻā§āϞ āĻāϤā§āϤāϰāĻĻāĻžāϤāĻž: 13%, āĻāϤā§āϤāϰ āĻāϰā§āύāύāĻŋ: 10%",}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
# url='https://openai.com/api/pricing/',
url = url,
config=run_config,
scan_full_page=True,
scroll_delay=0.2,
remove_overlay_elements = True
)
# print(result.html)
data = json.loads(result.extracted_content)
flname = url.split('/')[-2]
with open('{}.json'.format(flname), 'w', encoding='utf-8') as json_file:
json.dump(data, json_file, ensure_ascii=False, indent=4)
if __name__ == "__main__":
with open('target_urls.txt', 'r') as f:
urls = [line.strip() for line in f.readlines()][:1]
for url in tqdm(urls):
asyncio.run(main(url))
OS
Linux
Python version
3.12.8
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... â Crawl4AI 0.4.247 [FETCH]... â https://web.livemcq.com/job-solution/bcs-question-... | Status: True | Time: 2.56s [SCRAPE].. â Processed https://web.livemcq.com/job-solution/bcs-question-... | Time: 73ms [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 0 [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 1 [LOG] Call LLM for https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ - block index: 2 INFO:httpx:HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK" /home/reasat09/miniconda3/envs/crawl4ai/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
- 'fields' has been removed warnings.warn(message, UserWarning) 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai 18:37:01 - LiteLLM:INFO: utils.py:2820 - LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:LiteLLM: LiteLLM completion() model= gpt-4o-mini-2024-07-18; provider = openai INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:37:04 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 1 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 0 INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:37:05 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 1 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 2 ^CINFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 18:42:10 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler INFO:LiteLLM:Wrapper: Completed Call, calling success_handler [LOG] Extracted 152 blocks from URL: https://web.livemcq.com/job-solution/bcs-question-bank/35th-bcs-question-solution-pdf/ block index: 1 [EXTRACT]. â Completed for https://web.livemcq.com/job-solution/bcs-question-... | Time: 310.29557266272604s [COMPLETE] â https://web.livemcq.com/job-solution/bcs-question-... | Status: True | Total: 312.93s
Im came across this bug as well.
js_code = """
(async function () {
let scrollDuration = 60 * 1000; // Total duration in milliseconds (e.g., 30 seconds)
let scrollDelay = 4 * 1000; // Delay time between scrolls in milliseconds (e.g., 5 seconds)
let scrollAmount = 500; // Scroll by 500 pixels each time
let startTime = Date.now();
let lastScrollHeight = 0;
while (Date.now() - startTime < scrollDuration) {
// Scroll down by a fixed number of pixels
window.scrollBy(0, scrollAmount);
console.log('Scrolling down by', scrollAmount, 'pixels.');
// Wait for lazy-loaded content to appear (delay time)
await new Promise(resolve => setTimeout(resolve, scrollDelay));
// Check if new content has loaded by comparing scroll height
let currentScrollHeight = document.body.scrollHeight;
// If the scroll height doesn't change, stop scrolling
lastScrollHeight = currentScrollHeight;
}
console.log('Finished scrolling after ' + (Date.now() - startTime) / 1000 + ' seconds.');
})();
""",
)
It still returned incompleted result Note: Im using crawl4ai 0.3.76 since the new version of #457 hasnt been released yet
@aravindkarnam This must be checked cross the new version.
@Reasat Thx for trying the Crawl4ai, please upgrade and check this again.
Hello!
I'm encountering a similar issue, as shown in the following example:
Source link: https://www.medscape.com/viewarticle/food-advertised-nfl-games-loaded-salt-fat-calories-2025a10002cy
When evaluating the results, I noticed that parts of the content are missing. This seems to happen randomly.
LLM Strategy
return LLMExtractionStrategy(
provider=provider_config["provider"], #gpt4o-mini
api_token=provider_config["api_key"],
schema=MainContentModel.model_json_schema(),
extraction_type="schema",
instruction=prompt,
chunk_token_threshold=1400,
overlap_rate=0.1,
apply_chunking=True,
input_format="html",
extra_args={
"temperature": provider_config["temperature"],
"max_tokens": provider_config["max_tokens"]
}
)
Model
class MainContentModel(BaseModel):
page_type: str = Field(description="1 if webpage is about one specific item, 0 otherwise")
title: str = Field(description="Title of the article")
content: str = Field(description="Main article content in HTML format")
published_at: str = Field(description="Published date in YYYY-MM-DD format")
error: str = Field(description="1 if error occurred, 0 otherwise")
Prompt
prompt = """You are a web scraping assistant with extensive experience in extracting data from various types of webpages.
Your expertise lies in accurately retrieving structured information while maintaining the integrity of the original content.
Your task is to extract specific information from a given webpage. Please adhere to the following guidelines while performing the extraction:
- page_type: Return '1' if the webpage is entirely about only one specific press release, event announcement, blog post or research article; otherwise stop the execution and return '0'.
- title: Extract the title of the article exactly as it appears on the webpage. Ensure that it is free from any additional formatting or interpretation.
- content: Extract the entire main article of the webpage in HTML format exactly as it appears. Maintain the original structure, including headings, paragraphs, lists, and any other HTML elements present. Format the data as HTML.
- published_at: Extract the published date of the article in the format YYYY-MM-DD. Ensure that the date is accurately sourced from the webpage, and if multiple dates are present, select the most relevant one associated with the article.
- error: Return '1' if there is any message or content that suggests an error occurred during the web scraping extraction, otherwise return '0'. Always use numbers 0 and 1.
"""
Sample of the missing data
Upon further investigation, it seems there's an issue with the links in the content.
Example 2
Example 3
@unclecode I am not sure the chunking code written here works as intended. If a document have a total token greater than the chunk_threshold, it does not chunk the doc for me. The whole unchunked text goes to the model for processing. There should be an error or warning that tells the user this is happening. But the processing runs (probably on a truncated text?).
I actually don't understand the logic behind the chunking code. Also, I do not understand why 'chunks' are named as 'sections'. Or why the function is called 'merge' rather than 'chunking'
Here's my implementation of chunking to replace the existing function that works for my target website.
def _merge2(self, documents, chunk_token_threshold, overlap):
"""
Merge documents into sections based on chunk_token_threshold and overlap.
"""
# each chunk needs to have a tokens <= chunk_token_threshold with overlap
def generate_chunk_index(start, end, step, overlap):
ranges = []
while start < end:
next_start = min(start + step, end)
ranges.append((start, next_start))
start += step - overlap # Move forward while maintaining overlap
return ranges
# Example usage:
# start = 0
# end = 20
# step = 5
# overlap = 2
# index_ranges = generate_chunk_index(start, end, step, overlap)
# print(index_ranges)
# [(0, 5), (3, 8), (6, 11), (9, 14), (12, 17), (15, 20)]
sections = []
print('doc num', len(documents))
for text in documents:
# print(text)
words = text.split(' ')
print('words num', len(words))
tokens_num = math.ceil(len(words)*self.word_token_rate)
print('tokens num', tokens_num)
chunk_ranges = generate_chunk_index(0,tokens_num, chunk_token_threshold, overlap)
print('chunks num', len(chunk_ranges))
for i,j in chunk_ranges:
words_chunk = words[i:j]
text_chunk = ' '.join(words_chunk)
sections.append(text_chunk)
print('sections num', len(sections))
return sections
@Reasat you are absolutely right, and just yesterday I noticed it. Actually I replaced with another algorithm 12x faster, check the "next" branch, and this will be available in next version.
@Reasat Regarding the naming and purpose of the function:
We start with a list of initial string chunks:
C = [C1, C2, ..., Cn]
Each chunk Ci contains a certain number of tokens. Our goal is to transform C into a new list where each chunk has approximately CTX tokens, ensuring efficient LLM processing.
Understanding CTX:
- CTX is the target number of tokens per call to the language model.
- It is user-defined and determines how input is batched.
- Example: If the total token count is 1000 and CTX = 200, we will create 5 calls (1000 / 200).
Transformation Process:
- Initialize an empty chunk S and iterate over C.
- For each chunk Ci, check if adding it keeps |S| <= CTX:
if |S| + |Ci| <= CTX: S += Ci else: output S, start new chunk with Ci
- Continue until all chunks are processed.
Result: A transformed list S = [S1, S2, ..., Sm] where each Si has |Si| â CTX at O(n).
Why I choose "merge", tbh I don't know đ¤ˇđ perhaps late night coding!
@unclecode thanks a lot for the description! I'll check out the implementation in the next branch.