crawl4ai
crawl4ai copied to clipboard
Version 0.3.71 is more stable than 0.3.72
Hello,
I've found that version 0.3.71 is significantly more stable compared to 0.3.72. The fit_markdown function consistently returns empty results in the newer version.
Additionally, using magic=True limits crawling capabilities for many websites. For example, while it’s possible to crawl this site with magic=False, it becomes inaccessible when magic=True is enabled. Also, the remove_overlay_elements=True option doesn’t seem to work as expected.
I appreciate all your hard work on this library, and I understand these issues aren't easy to resolve. However, I suggest considering a temporary rollback to version 0.3.71 until 0.3.72 is stabilized.
Thank you,
@unclecode I’ve revised certain parts of the AsyncCrawlerStrategy, and I believe this provides a solid foundation for you to build upon.
import asyncio
import base64
import time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os
from playwright.async_api import async_playwright, Page, Browser, Error
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from playwright_stealth import StealthConfig, stealth_async
import random
stealth_config = StealthConfig(
webdriver=True,
chrome_app=True,
chrome_csi=True,
chrome_load_times=True,
chrome_runtime=True,
navigator_languages=True,
navigator_plugins=True,
navigator_permissions=True,
webgl_vendor=True,
outerdimensions=True,
navigator_hardware_concurrency=True,
media_codecs=True,
)
advanced_stealth_config = """
() => {
// Override property descriptors
const propertyDescriptors = {
// Navigator properties
hardwareConcurrency: { value: 4 + Math.floor(Math.random() * 4) },
deviceMemory: { value: 8 + Math.floor(Math.random() * 8) },
userAgent: { value: window.navigator.userAgent.replace(/\\(.*?\\)/, '(Windows NT 10.0; Win64; x64)') },
platform: { value: 'Win32' },
language: { value: 'en-US' },
languages: { value: ['en-US', 'en'] },
vendor: { value: 'Google Inc.' },
// Screen properties
width: { value: 1920 },
height: { value: 1080 },
colorDepth: { value: 24 },
pixelDepth: { value: 24 },
// Additional properties
doNotTrack: { value: null },
maxTouchPoints: { value: 0 },
webdriver: { value: undefined },
};
// Helper to define non-configurable properties
const defineProperty = (obj, prop, value) => {
Object.defineProperty(obj, prop, {
...value,
configurable: false,
enumerable: true,
writable: false
});
};
// Override navigator properties
for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
try {
defineProperty(Object.getPrototypeOf(navigator), key, descriptor);
} catch (e) {}
}
// Override screen properties
for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
try {
defineProperty(screen, key, descriptor);
} catch (e) {}
}
// Advanced WebGL fingerprint spoofing
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
// Spoof renderer info
if (parameter === 37445) {
return 'Intel Open Source Technology Center';
}
// Spoof vendor info
if (parameter === 37446) {
return 'Mesa DRI Intel(R) HD Graphics 520 (Skylake GT2)';
}
return getParameter.apply(this, arguments);
};
// Add canvas noise
const originalGetContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function() {
const context = originalGetContext.apply(this, arguments);
if (context && arguments[0] === '2d') {
const originalFillRect = context.fillRect;
context.fillRect = function() {
originalFillRect.apply(this, arguments);
const pixels = context.getImageData(0, 0, this.canvas.width, this.canvas.height);
for (let i = 0; i < pixels.data.length; i += 4) {
pixels.data[i] += Math.floor(Math.random() * 2);
}
context.putImageData(pixels, 0, 0);
};
}
return context;
};
// Hide automation-related properties
const automationProperties = [
'webdriver',
'__webdriver_evaluate',
'__selenium_evaluate',
'__webdriver_script_function',
'__webdriver_script_func',
'__webdriver_script_fn',
'__fxdriver_evaluate',
'__driver_unwrapped',
'__webdriver_unwrapped',
'__driver_evaluate',
'__selenium_unwrapped',
'__fxdriver_unwrapped',
'_Selenium_IDE_Recorder',
'_selenium',
'calledSelenium',
'$cdc_asdjflasutopfhvcZLmcfl_',
'$chrome_asyncScriptInfo',
'__$webdriverAsyncExecutor',
'WebDriver'
];
// Remove automation-related properties
automationProperties.forEach(prop => {
Object.defineProperty(window, prop, {
get: () => undefined,
set: () => {},
configurable: false
});
});
// Spoof permissions API
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = function(parameters) {
return parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery.apply(this, arguments);
};
// Add basic browser functionality expected by bot detectors
window.chrome = {
app: {
InstallState: {
DISABLED: 'disabled',
INSTALLED: 'installed',
NOT_INSTALLED: 'not_installed'
},
RunningState: {
CANNOT_RUN: 'cannot_run',
READY_TO_RUN: 'ready_to_run',
RUNNING: 'running'
},
getDetails: function() {},
getIsInstalled: function() {},
installState: function() {},
isInstalled: false,
runningState: function() {}
},
runtime: {
OnInstalledReason: {
CHROME_UPDATE: 'chrome_update',
INSTALL: 'install',
SHARED_MODULE_UPDATE: 'shared_module_update',
UPDATE: 'update'
},
PlatformArch: {
ARM: 'arm',
ARM64: 'arm64',
MIPS: 'mips',
MIPS64: 'mips64',
X86_32: 'x86-32',
X86_64: 'x86-64'
},
PlatformNaclArch: {
ARM: 'arm',
MIPS: 'mips',
MIPS64: 'mips64',
X86_32: 'x86-32',
X86_64: 'x86-64'
},
PlatformOs: {
ANDROID: 'android',
CROS: 'cros',
LINUX: 'linux',
MAC: 'mac',
OPENBSD: 'openbsd',
WIN: 'win'
},
RequestUpdateCheckStatus: {
NO_UPDATE: 'no_update',
THROTTLED: 'throttled',
UPDATE_AVAILABLE: 'update_available'
}
}
};
}
"""
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
status_code: int
screenshot: Optional[str] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
class Config:
arbitrary_types_allowed = True
class AsyncCrawlerStrategy(ABC):
@abstractmethod
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
pass
@abstractmethod
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
pass
@abstractmethod
async def take_screenshot(self, **kwargs) -> str:
pass
@abstractmethod
def update_user_agent(self, user_agent: str):
pass
@abstractmethod
def set_hook(self, hook_type: str, hook: Callable):
pass
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
self.use_cached_html = use_cached_html
self.user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.proxy = kwargs.get("proxy")
self.proxy_config = kwargs.get("proxy_config")
self.headless = kwargs.get("headless", True)
self.browser_type = kwargs.get("browser_type", "chromium")
self.headers = kwargs.get("headers", {})
self.sessions = {}
self.session_ttl = 1800
self.js_code = js_code
self.verbose = kwargs.get("verbose", False)
self.playwright = None
self.browser = None
self.sleep_on_close = kwargs.get("sleep_on_close", False)
self.hooks = {
'on_browser_created': None,
'on_user_agent_updated': None,
'on_execution_started': None,
'before_goto': None,
'after_goto': None,
'before_return_html': None,
'before_retrieve_html': None
}
async def __aenter__(self):
await self.start()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def start(self):
if self.playwright is None:
self.playwright = await async_playwright().start()
if self.browser is None:
browser_args = {
"headless": self.headless,
"args": [
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--window-position=0,0",
"--ignore-certificate-errors",
"--ignore-certificate-errors-spki-list",
"--disable-features=IsolateOrigins,site-per-process",
"--disable-site-isolation-trials",
"--disable-web-security",
"--disable-features=site-per-process",
"--start-maximized"
]
}
if self.proxy:
proxy_settings = ProxySettings(server=self.proxy)
browser_args["proxy"] = proxy_settings
elif self.proxy_config:
proxy_settings = ProxySettings(
server=self.proxy_config.get("server"),
username=self.proxy_config.get("username"),
password=self.proxy_config.get("password")
)
browser_args["proxy"] = proxy_settings
if self.browser_type == "firefox":
self.browser = await self.playwright.firefox.launch(**browser_args)
elif self.browser_type == "webkit":
self.browser = await self.playwright.webkit.launch(**browser_args)
else:
self.browser = await self.playwright.chromium.launch(**browser_args)
await self.execute_hook('on_browser_created', self.browser)
async def close(self):
if self.sleep_on_close:
await asyncio.sleep(0.5)
if self.browser:
await self.browser.close()
self.browser = None
if self.playwright:
await self.playwright.stop()
self.playwright = None
def __del__(self):
if self.browser or self.playwright:
asyncio.get_event_loop().run_until_complete(self.close())
def set_hook(self, hook_type: str, hook: Callable):
if hook_type in self.hooks:
self.hooks[hook_type] = hook
else:
raise ValueError(f"Invalid hook type: {hook_type}")
async def execute_hook(self, hook_type: str, *args):
hook = self.hooks.get(hook_type)
if hook:
if asyncio.iscoroutinefunction(hook):
return await hook(*args)
else:
return hook(*args)
return args[0] if args else None
def update_user_agent(self, user_agent: str):
self.user_agent = user_agent
def set_custom_headers(self, headers: Dict[str, str]):
self.headers = headers
async def kill_session(self, session_id: str):
if session_id in self.sessions:
context, page, _ = self.sessions[session_id]
await page.close()
await context.close()
del self.sessions[session_id]
def _cleanup_expired_sessions(self):
current_time = time.time()
expired_sessions = [
sid for sid, (_, _, last_used) in self.sessions.items()
if current_time - last_used > self.session_ttl
]
for sid in expired_sessions:
asyncio.create_task(self.kill_session(sid))
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
wait_for = wait_for.strip()
if wait_for.startswith('js:'):
# Explicitly specified JavaScript
js_code = wait_for[3:].strip()
return await self.csp_compliant_wait(page, js_code, timeout)
elif wait_for.startswith('css:'):
# Explicitly specified CSS selector
css_selector = wait_for[4:].strip()
try:
await page.wait_for_selector(css_selector, timeout=timeout)
except Error as e:
if 'Timeout' in str(e):
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
else:
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
else:
# Auto-detect based on content
if wait_for.startswith('()') or wait_for.startswith('function'):
# It's likely a JavaScript function
return await self.csp_compliant_wait(page, wait_for, timeout)
else:
# Assume it's a CSS selector first
try:
await page.wait_for_selector(wait_for, timeout=timeout)
except Error as e:
if 'Timeout' in str(e):
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
else:
# If it's not a timeout error, it might be an invalid selector
# Let's try to evaluate it as a JavaScript function as a fallback
try:
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
except Error:
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
"It should be either a valid CSS selector, a JavaScript function, "
"or explicitly prefixed with 'js:' or 'css:'.")
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
wrapper_js = f"""
async () => {{
const userFunction = {user_wait_function};
const startTime = Date.now();
while (true) {{
if (await userFunction()) {{
return true;
}}
if (Date.now() - startTime > {timeout}) {{
throw new Error('Timeout waiting for condition');
}}
await new Promise(resolve => setTimeout(resolve, 100));
}}
}}
"""
try:
await page.evaluate(wrapper_js)
except TimeoutError:
raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
except Exception as e:
raise RuntimeError(f"Error in wait condition: {str(e)}")
async def process_iframes(self, page):
# Find all iframes
iframes = await page.query_selector_all('iframe')
for i, iframe in enumerate(iframes):
try:
# Add a unique identifier to the iframe
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
# Get the frame associated with this iframe
frame = await iframe.content_frame()
if frame:
# Wait for the frame to load
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
# Extract the content of the iframe's body
iframe_content = await frame.evaluate('() => document.body.innerHTML')
# Generate a unique class name for this iframe
class_name = f'extracted-iframe-content-{i}'
# Replace the iframe with a div containing the extracted content
_iframe = iframe_content.replace('`', '\\`')
await page.evaluate(f"""
() => {{
const iframe = document.getElementById('iframe-{i}');
const div = document.createElement('div');
div.innerHTML = `{_iframe}`;
div.className = '{class_name}';
iframe.replaceWith(div);
}}
""")
else:
print(f"Warning: Could not access content frame for iframe {i}")
except Exception as e:
print(f"Error processing iframe {i}: {str(e)}")
# Return the page object
return page
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
response_headers = {}
status_code = None
page = None
context = None
try:
self._cleanup_expired_sessions()
session_id = kwargs.get("session_id")
# Advanced stealth configuration
advanced_stealth_config = """
() => {
// Override property descriptors
const propertyDescriptors = {
hardwareConcurrency: { value: 4 + Math.floor(Math.random() * 4) },
deviceMemory: { value: 8 + Math.floor(Math.random() * 8) },
userAgent: { value: window.navigator.userAgent.replace(/\\(.*?\\)/, '(Windows NT 10.0; Win64; x64)') },
platform: { value: 'Win32' },
language: { value: 'en-US' },
languages: { value: ['en-US', 'en'] },
vendor: { value: 'Google Inc.' },
width: { value: 1920 },
height: { value: 1080 },
colorDepth: { value: 24 },
pixelDepth: { value: 24 },
doNotTrack: { value: null },
maxTouchPoints: { value: 0 },
webdriver: { value: undefined },
};
// Define properties
for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
try {
Object.defineProperty(Object.getPrototypeOf(navigator), key, {
...descriptor,
configurable: false,
enumerable: true,
writable: false
});
} catch (e) {}
}
// WebGL fingerprint spoofing
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return 'Intel Open Source Technology Center';
if (parameter === 37446) return 'Mesa DRI Intel(R) HD Graphics 520 (Skylake GT2)';
return getParameter.apply(this, arguments);
};
// Add chrome runtime
window.chrome = {
app: {
InstallState: {
DISABLED: 'disabled',
INSTALLED: 'installed',
NOT_INSTALLED: 'not_installed'
},
RunningState: {
CANNOT_RUN: 'cannot_run',
READY_TO_RUN: 'ready_to_run',
RUNNING: 'running'
},
getDetails: function() {},
getIsInstalled: function() {},
installState: function() {},
isInstalled: false,
runningState: function() {}
},
runtime: {
OnInstalledReason: {},
PlatformArch: {},
PlatformNaclArch: {},
PlatformOs: {},
RequestUpdateCheckStatus: {}
}
};
// Permissions API spoofing
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = function(parameters) {
return parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery.apply(this, arguments);
};
};
"""
# Session handling
if session_id:
context, page, _ = self.sessions.get(session_id, (None, None, None))
if not context:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None,
accept_downloads=True,
java_script_enabled=True
)
await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
await context.set_extra_http_headers(self.headers)
page = await context.new_page()
self.sessions[session_id] = (context, page, time.time())
else:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None
)
await context.set_extra_http_headers(self.headers)
if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic",
False):
print("[LOG] 🕵️‍♂️ Applying advanced stealth configuration...")
await context.add_init_script(advanced_stealth_config)
page = await context.new_page()
await self._remove_obstacles(page)
# Add console logging if requested
if kwargs.get("log_console", False):
page.on("console", lambda msg: print(f"Console: {msg.text}"))
page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
# Navigation
if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
if not kwargs.get("js_only", False):
await self.execute_hook('before_goto', page)
response = await page.goto(url, wait_until="domcontentloaded",
timeout=kwargs.get("page_timeout", 60000))
await self.execute_hook('after_goto', page)
status_code = response.status
response_headers = response.headers
else:
status_code = 200
response_headers = {}
# Human behavior simulation if magic mode is enabled
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
await page.evaluate("""
async () => {
// Smooth scrolling
const smoothScroll = async () => {
const height = Math.max(document.body.scrollHeight, document.documentElement.scrollHeight);
const scrollSteps = Math.floor(height / 100);
for (let i = 0; i < scrollSteps; i++) {
window.scrollTo({
top: i * 100,
behavior: 'smooth'
});
await new Promise(r => setTimeout(r, 100 + Math.random() * 400));
}
};
// Mouse movement simulation
const simulateMouseMovement = async () => {
const moves = 10 + Math.floor(Math.random() * 20);
for (let i = 0; i < moves; i++) {
const x = Math.random() * window.innerWidth;
const y = Math.random() * window.innerHeight;
const event = new MouseEvent('mousemove', {
view: window,
bubbles: true,
cancelable: true,
clientX: x,
clientY: y
});
document.dispatchEvent(event);
await new Promise(r => setTimeout(r, 50 + Math.random() * 200));
}
};
await smoothScroll();
await simulateMouseMovement();
}
""")
# Add random delays
await page.wait_for_timeout(1000 + int(2000 * random.random()))
# Continue with the rest of your existing crawl method...
# [Your existing code for wait_for, screenshots, etc.]
# Get the final HTML
html = await page.content()
# Return the response
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code
)
except Exception as e:
raise Error(f"[ERROR] đźš« crawl(): Failed to crawl {url}: {str(e)}")
finally:
if not session_id and page:
await page.close()
if not session_id and context:
await context.close()
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
semaphore = asyncio.Semaphore(semaphore_count)
async def crawl_with_semaphore(url):
async with semaphore:
return await self.crawl(url, **kwargs)
tasks = [crawl_with_semaphore(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [result if not isinstance(result, Exception) else str(result) for result in results]
async def _remove_obstacles(self, page: Page):
"""Remove overlays, cookie notices, and other obstacles"""
await page.evaluate("""
() => {
const removeElements = () => {
const selectors = [
// Cookie and consent related
'#didomi-host',
'.didomi-popup-backdrop',
'.didomi-popup-notice',
'.fc-consent-root',
'.fc-dialog-overlay',
'[class*="cookie-banner"]',
'[class*="consent"]',
'[class*="gdpr"]',
// Modals and overlays
'.modal',
'.overlay',
'[class*="paywall"]',
'[class*="subscribe"]',
'[role="dialog"]',
// Other potential obstacles
'.interstitial',
'.spinner',
'.loading',
'#loading',
];
selectors.forEach(selector => {
document.querySelectorAll(selector).forEach(elem => {
elem.remove();
});
});
// Fix page scrolling
document.body.style.overflow = 'auto';
document.documentElement.style.overflow = 'auto';
document.body.style.position = 'static';
};
// Initial cleanup
removeElements();
// Watch for dynamic additions
const observer = new MutationObserver(removeElements);
observer.observe(document.body, {
childList: true,
subtree: true
});
// Stop observing after 10 seconds
setTimeout(() => observer.disconnect(), 10000);
}
""")
# Small delay to ensure cleanup
await asyncio.sleep(1)
async def remove_overlay_elements(self, page: Page) -> None:
"""
Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.
Args:
page (Page): The Playwright page instance
"""
remove_overlays_js = """
async () => {
// Function to check if element is visible
const isVisible = (elem) => {
const style = window.getComputedStyle(elem);
return style.display !== 'none' &&
style.visibility !== 'hidden' &&
style.opacity !== '0';
};
// Common selectors for popups and overlays
const commonSelectors = [
// Close buttons first
'button[class*="close" i]', 'button[class*="dismiss" i]',
'button[aria-label*="close" i]', 'button[title*="close" i]',
'a[class*="close" i]', 'span[class*="close" i]',
// Cookie notices
'[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
'[class*="cookie-consent" i]', '[id*="cookie-consent" i]',
// Newsletter/subscription dialogs
'[class*="newsletter" i]', '[class*="subscribe" i]',
// Generic popups/modals
'[class*="popup" i]', '[class*="modal" i]',
'[class*="overlay" i]', '[class*="dialog" i]',
'[role="dialog"]', '[role="alertdialog"]'
];
// Try to click close buttons first
for (const selector of commonSelectors.slice(0, 6)) {
const closeButtons = document.querySelectorAll(selector);
for (const button of closeButtons) {
if (isVisible(button)) {
try {
button.click();
await new Promise(resolve => setTimeout(resolve, 100));
} catch (e) {
console.log('Error clicking button:', e);
}
}
}
}
// Remove remaining overlay elements
const removeOverlays = () => {
// Find elements with high z-index
const allElements = document.querySelectorAll('*');
for (const elem of allElements) {
const style = window.getComputedStyle(elem);
const zIndex = parseInt(style.zIndex);
const position = style.position;
if (
isVisible(elem) &&
(zIndex > 999 || position === 'fixed' || position === 'absolute') &&
(
elem.offsetWidth > window.innerWidth * 0.5 ||
elem.offsetHeight > window.innerHeight * 0.5 ||
style.backgroundColor.includes('rgba') ||
parseFloat(style.opacity) < 1
)
) {
elem.remove();
}
}
// Remove elements matching common selectors
for (const selector of commonSelectors) {
const elements = document.querySelectorAll(selector);
elements.forEach(elem => {
if (isVisible(elem)) {
elem.remove();
}
});
}
};
// Remove overlay elements
removeOverlays();
// Remove any fixed/sticky position elements at the top/bottom
const removeFixedElements = () => {
const elements = document.querySelectorAll('*');
elements.forEach(elem => {
const style = window.getComputedStyle(elem);
if (
(style.position === 'fixed' || style.position === 'sticky') &&
isVisible(elem)
) {
elem.remove();
}
});
};
removeFixedElements();
// Remove empty block elements as: div, p, span, etc.
const removeEmptyBlockElements = () => {
const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
blockElements.forEach(elem => {
if (elem.innerText.trim() === '') {
elem.remove();
}
});
};
// Remove margin-right and padding-right from body (often added by modal scripts)
document.body.style.marginRight = '0px';
document.body.style.paddingRight = '0px';
document.body.style.overflow = 'auto';
// Wait a bit for any animations to complete
await new Promise(resolve => setTimeout(resolve, 100));
}
"""
try:
await page.evaluate(remove_overlays_js)
await page.wait_for_timeout(500) # Wait for any animations to complete
except Exception as e:
if self.verbose:
print(f"Warning: Failed to remove overlay elements: {str(e)}")
async def take_screenshot(self, page: Page) -> str:
try:
# The page is already loaded, just take the screenshot
screenshot = await page.screenshot(full_page=True)
return base64.b64encode(screenshot).decode('utf-8')
except Exception as e:
error_message = f"Failed to take screenshot: {str(e)}"
print(error_message)
# Generate an error image
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
buffered = BytesIO()
img.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode('utf-8')
finally:
await page.close()
Hi, The 'magic' and 'fit_markdown' both are experimental features and shouldn’t interfere with the library’s main functionality compared to version 0.3.71. The “magic” flag is set to false by default. I’m focusing on refining this approach to extract key parts of an article without relying on large language models, and it’ll take time to reach a stable solution.
I’d appreciate any feedback you have, especially if you encounter issues with standard crawling when the magic flag isn’t enabled. By default, the flag is false, and the fit_markdown shouldn’t affect the regular markdown output. Could you confirm this?
Thanks again for sharing the code suggestion, I’ll definitely review it soon, and update here.
Hi, version 0.3.71 is less verbose than 0.3.72 when it comes to markdown, which is why I decided to roll back to 0.3.71. The library still functions, but the magic and fit_markdown features aren’t working in this version.
@YassKhazzan This is very interesting. Would you please help me by sharing the URL and the markdown output when using versions 71 and 72? I want to understand what changes made version 71 less verbose, specifically in the markdown itself, not in footnotes or magic mode. I appreciate your help, thank you.