crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Version 0.3.71 is more stable than 0.3.72

Open YassKhazzan opened this issue 1 year ago • 4 comments
trafficstars

Hello,

I've found that version 0.3.71 is significantly more stable compared to 0.3.72. The fit_markdown function consistently returns empty results in the newer version.

Additionally, using magic=True limits crawling capabilities for many websites. For example, while it’s possible to crawl this site with magic=False, it becomes inaccessible when magic=True is enabled. Also, the remove_overlay_elements=True option doesn’t seem to work as expected.

I appreciate all your hard work on this library, and I understand these issues aren't easy to resolve. However, I suggest considering a temporary rollback to version 0.3.71 until 0.3.72 is stabilized.

Thank you,

YassKhazzan avatar Oct 29 '24 15:10 YassKhazzan

@unclecode I’ve revised certain parts of the AsyncCrawlerStrategy, and I believe this provides a solid foundation for you to build upon.

import asyncio
import base64
import time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os
from playwright.async_api import async_playwright, Page, Browser, Error
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from playwright_stealth import StealthConfig, stealth_async
import random

stealth_config = StealthConfig(
    webdriver=True,
    chrome_app=True,
    chrome_csi=True,
    chrome_load_times=True,
    chrome_runtime=True,
    navigator_languages=True,
    navigator_plugins=True,
    navigator_permissions=True,
    webgl_vendor=True,
    outerdimensions=True,
    navigator_hardware_concurrency=True,
    media_codecs=True,
)

advanced_stealth_config = """
() => {
    // Override property descriptors
    const propertyDescriptors = {
        // Navigator properties
        hardwareConcurrency: { value: 4 + Math.floor(Math.random() * 4) },
        deviceMemory: { value: 8 + Math.floor(Math.random() * 8) },
        userAgent: { value: window.navigator.userAgent.replace(/\\(.*?\\)/, '(Windows NT 10.0; Win64; x64)') },
        platform: { value: 'Win32' },
        language: { value: 'en-US' },
        languages: { value: ['en-US', 'en'] },
        vendor: { value: 'Google Inc.' },

        // Screen properties
        width: { value: 1920 },
        height: { value: 1080 },
        colorDepth: { value: 24 },
        pixelDepth: { value: 24 },

        // Additional properties
        doNotTrack: { value: null },
        maxTouchPoints: { value: 0 },
        webdriver: { value: undefined },
    };

    // Helper to define non-configurable properties
    const defineProperty = (obj, prop, value) => {
        Object.defineProperty(obj, prop, {
            ...value,
            configurable: false,
            enumerable: true,
            writable: false
        });
    };

    // Override navigator properties
    for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
        try {
            defineProperty(Object.getPrototypeOf(navigator), key, descriptor);
        } catch (e) {}
    }

    // Override screen properties
    for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
        try {
            defineProperty(screen, key, descriptor);
        } catch (e) {}
    }

    // Advanced WebGL fingerprint spoofing
    const getParameter = WebGLRenderingContext.prototype.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
        // Spoof renderer info
        if (parameter === 37445) {
            return 'Intel Open Source Technology Center';
        }
        // Spoof vendor info
        if (parameter === 37446) {
            return 'Mesa DRI Intel(R) HD Graphics 520 (Skylake GT2)';
        }
        return getParameter.apply(this, arguments);
    };

    // Add canvas noise
    const originalGetContext = HTMLCanvasElement.prototype.getContext;
    HTMLCanvasElement.prototype.getContext = function() {
        const context = originalGetContext.apply(this, arguments);
        if (context && arguments[0] === '2d') {
            const originalFillRect = context.fillRect;
            context.fillRect = function() {
                originalFillRect.apply(this, arguments);
                const pixels = context.getImageData(0, 0, this.canvas.width, this.canvas.height);
                for (let i = 0; i < pixels.data.length; i += 4) {
                    pixels.data[i] += Math.floor(Math.random() * 2);
                }
                context.putImageData(pixels, 0, 0);
            };
        }
        return context;
    };

    // Hide automation-related properties
    const automationProperties = [
        'webdriver',
        '__webdriver_evaluate',
        '__selenium_evaluate',
        '__webdriver_script_function',
        '__webdriver_script_func',
        '__webdriver_script_fn',
        '__fxdriver_evaluate',
        '__driver_unwrapped',
        '__webdriver_unwrapped',
        '__driver_evaluate',
        '__selenium_unwrapped',
        '__fxdriver_unwrapped',
        '_Selenium_IDE_Recorder',
        '_selenium',
        'calledSelenium',
        '$cdc_asdjflasutopfhvcZLmcfl_',
        '$chrome_asyncScriptInfo',
        '__$webdriverAsyncExecutor',
        'WebDriver'
    ];

    // Remove automation-related properties
    automationProperties.forEach(prop => {
        Object.defineProperty(window, prop, {
            get: () => undefined,
            set: () => {},
            configurable: false
        });
    });

    // Spoof permissions API
    const originalQuery = window.navigator.permissions.query;
    window.navigator.permissions.query = function(parameters) {
        return parameters.name === 'notifications' 
            ? Promise.resolve({ state: Notification.permission })
            : originalQuery.apply(this, arguments);
    };

    // Add basic browser functionality expected by bot detectors
    window.chrome = {
        app: {
            InstallState: {
                DISABLED: 'disabled',
                INSTALLED: 'installed',
                NOT_INSTALLED: 'not_installed'
            },
            RunningState: {
                CANNOT_RUN: 'cannot_run',
                READY_TO_RUN: 'ready_to_run',
                RUNNING: 'running'
            },
            getDetails: function() {},
            getIsInstalled: function() {},
            installState: function() {},
            isInstalled: false,
            runningState: function() {}
        },
        runtime: {
            OnInstalledReason: {
                CHROME_UPDATE: 'chrome_update',
                INSTALL: 'install',
                SHARED_MODULE_UPDATE: 'shared_module_update',
                UPDATE: 'update'
            },
            PlatformArch: {
                ARM: 'arm',
                ARM64: 'arm64',
                MIPS: 'mips',
                MIPS64: 'mips64',
                X86_32: 'x86-32',
                X86_64: 'x86-64'
            },
            PlatformNaclArch: {
                ARM: 'arm',
                MIPS: 'mips',
                MIPS64: 'mips64',
                X86_32: 'x86-32',
                X86_64: 'x86-64'
            },
            PlatformOs: {
                ANDROID: 'android',
                CROS: 'cros',
                LINUX: 'linux',
                MAC: 'mac',
                OPENBSD: 'openbsd',
                WIN: 'win'
            },
            RequestUpdateCheckStatus: {
                NO_UPDATE: 'no_update',
                THROTTLED: 'throttled',
                UPDATE_AVAILABLE: 'update_available'
            }
        }
    };
}
"""


class AsyncCrawlResponse(BaseModel):
    html: str
    response_headers: Dict[str, str]
    status_code: int
    screenshot: Optional[str] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None

    class Config:
        arbitrary_types_allowed = True


class AsyncCrawlerStrategy(ABC):
    @abstractmethod
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        pass

    @abstractmethod
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
        pass

    @abstractmethod
    async def take_screenshot(self, **kwargs) -> str:
        pass

    @abstractmethod
    def update_user_agent(self, user_agent: str):
        pass

    @abstractmethod
    def set_hook(self, hook_type: str, hook: Callable):
        pass


class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
        self.user_agent = kwargs.get(
            "user_agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
        )
        self.proxy = kwargs.get("proxy")
        self.proxy_config = kwargs.get("proxy_config")
        self.headless = kwargs.get("headless", True)
        self.browser_type = kwargs.get("browser_type", "chromium")
        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
        self.sleep_on_close = kwargs.get("sleep_on_close", False)
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
            'on_execution_started': None,
            'before_goto': None,
            'after_goto': None,
            'before_return_html': None,
            'before_retrieve_html': None
        }

    async def __aenter__(self):
        await self.start()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.close()

    async def start(self):
        if self.playwright is None:
            self.playwright = await async_playwright().start()
        if self.browser is None:
            browser_args = {
                "headless": self.headless,
                "args": [
                    "--disable-gpu",
                    "--no-sandbox",
                    "--disable-dev-shm-usage",
                    "--disable-blink-features=AutomationControlled",
                    "--disable-infobars",
                    "--window-position=0,0",
                    "--ignore-certificate-errors",
                    "--ignore-certificate-errors-spki-list",
                    "--disable-features=IsolateOrigins,site-per-process",
                    "--disable-site-isolation-trials",
                    "--disable-web-security",
                    "--disable-features=site-per-process",
                    "--start-maximized"
                ]
            }

            if self.proxy:
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
            elif self.proxy_config:
                proxy_settings = ProxySettings(
                    server=self.proxy_config.get("server"),
                    username=self.proxy_config.get("username"),
                    password=self.proxy_config.get("password")
                )
                browser_args["proxy"] = proxy_settings

            if self.browser_type == "firefox":
                self.browser = await self.playwright.firefox.launch(**browser_args)
            elif self.browser_type == "webkit":
                self.browser = await self.playwright.webkit.launch(**browser_args)
            else:
                self.browser = await self.playwright.chromium.launch(**browser_args)

            await self.execute_hook('on_browser_created', self.browser)

    async def close(self):
        if self.sleep_on_close:
            await asyncio.sleep(0.5)
        if self.browser:
            await self.browser.close()
            self.browser = None
        if self.playwright:
            await self.playwright.stop()
            self.playwright = None

    def __del__(self):
        if self.browser or self.playwright:
            asyncio.get_event_loop().run_until_complete(self.close())

    def set_hook(self, hook_type: str, hook: Callable):
        if hook_type in self.hooks:
            self.hooks[hook_type] = hook
        else:
            raise ValueError(f"Invalid hook type: {hook_type}")

    async def execute_hook(self, hook_type: str, *args):
        hook = self.hooks.get(hook_type)
        if hook:
            if asyncio.iscoroutinefunction(hook):
                return await hook(*args)
            else:
                return hook(*args)
        return args[0] if args else None

    def update_user_agent(self, user_agent: str):
        self.user_agent = user_agent

    def set_custom_headers(self, headers: Dict[str, str]):
        self.headers = headers

    async def kill_session(self, session_id: str):
        if session_id in self.sessions:
            context, page, _ = self.sessions[session_id]
            await page.close()
            await context.close()
            del self.sessions[session_id]

    def _cleanup_expired_sessions(self):
        current_time = time.time()
        expired_sessions = [
            sid for sid, (_, _, last_used) in self.sessions.items()
            if current_time - last_used > self.session_ttl
        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))

    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
        wait_for = wait_for.strip()

        if wait_for.startswith('js:'):
            # Explicitly specified JavaScript
            js_code = wait_for[3:].strip()
            return await self.csp_compliant_wait(page, js_code, timeout)
        elif wait_for.startswith('css:'):
            # Explicitly specified CSS selector
            css_selector = wait_for[4:].strip()
            try:
                await page.wait_for_selector(css_selector, timeout=timeout)
            except Error as e:
                if 'Timeout' in str(e):
                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
                else:
                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
        else:
            # Auto-detect based on content
            if wait_for.startswith('()') or wait_for.startswith('function'):
                # It's likely a JavaScript function
                return await self.csp_compliant_wait(page, wait_for, timeout)
            else:
                # Assume it's a CSS selector first
                try:
                    await page.wait_for_selector(wait_for, timeout=timeout)
                except Error as e:
                    if 'Timeout' in str(e):
                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
                    else:
                        # If it's not a timeout error, it might be an invalid selector
                        # Let's try to evaluate it as a JavaScript function as a fallback
                        try:
                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
                        except Error:
                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
                                             "It should be either a valid CSS selector, a JavaScript function, "
                                             "or explicitly prefixed with 'js:' or 'css:'.")

    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
        async () => {{
            const userFunction = {user_wait_function};
            const startTime = Date.now();
            while (true) {{
                if (await userFunction()) {{
                    return true;
                }}
                if (Date.now() - startTime > {timeout}) {{
                    throw new Error('Timeout waiting for condition');
                }}
                await new Promise(resolve => setTimeout(resolve, 100));
            }}
        }}
        """

        try:
            await page.evaluate(wrapper_js)
        except TimeoutError:
            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
        except Exception as e:
            raise RuntimeError(f"Error in wait condition: {str(e)}")

    async def process_iframes(self, page):
        # Find all iframes
        iframes = await page.query_selector_all('iframe')

        for i, iframe in enumerate(iframes):
            try:
                # Add a unique identifier to the iframe
                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')

                # Get the frame associated with this iframe
                frame = await iframe.content_frame()

                if frame:
                    # Wait for the frame to load
                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout

                    # Extract the content of the iframe's body
                    iframe_content = await frame.evaluate('() => document.body.innerHTML')

                    # Generate a unique class name for this iframe
                    class_name = f'extracted-iframe-content-{i}'

                    # Replace the iframe with a div containing the extracted content
                    _iframe = iframe_content.replace('`', '\\`')
                    await page.evaluate(f"""
                        () => {{
                            const iframe = document.getElementById('iframe-{i}');
                            const div = document.createElement('div');
                            div.innerHTML = `{_iframe}`;
                            div.className = '{class_name}';
                            iframe.replaceWith(div);
                        }}
                    """)
                else:
                    print(f"Warning: Could not access content frame for iframe {i}")
            except Exception as e:
                print(f"Error processing iframe {i}: {str(e)}")

        # Return the page object
        return page

    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
        status_code = None
        page = None
        context = None

        try:
            self._cleanup_expired_sessions()
            session_id = kwargs.get("session_id")

            # Advanced stealth configuration
            advanced_stealth_config = """
            () => {
                // Override property descriptors
                const propertyDescriptors = {
                    hardwareConcurrency: { value: 4 + Math.floor(Math.random() * 4) },
                    deviceMemory: { value: 8 + Math.floor(Math.random() * 8) },
                    userAgent: { value: window.navigator.userAgent.replace(/\\(.*?\\)/, '(Windows NT 10.0; Win64; x64)') },
                    platform: { value: 'Win32' },
                    language: { value: 'en-US' },
                    languages: { value: ['en-US', 'en'] },
                    vendor: { value: 'Google Inc.' },
                    width: { value: 1920 },
                    height: { value: 1080 },
                    colorDepth: { value: 24 },
                    pixelDepth: { value: 24 },
                    doNotTrack: { value: null },
                    maxTouchPoints: { value: 0 },
                    webdriver: { value: undefined },
                };

                // Define properties
                for (const [key, descriptor] of Object.entries(propertyDescriptors)) {
                    try {
                        Object.defineProperty(Object.getPrototypeOf(navigator), key, {
                            ...descriptor,
                            configurable: false,
                            enumerable: true,
                            writable: false
                        });
                    } catch (e) {}
                }

                // WebGL fingerprint spoofing
                const getParameter = WebGLRenderingContext.prototype.getParameter;
                WebGLRenderingContext.prototype.getParameter = function(parameter) {
                    if (parameter === 37445) return 'Intel Open Source Technology Center';
                    if (parameter === 37446) return 'Mesa DRI Intel(R) HD Graphics 520 (Skylake GT2)';
                    return getParameter.apply(this, arguments);
                };

                // Add chrome runtime
                window.chrome = {
                    app: {
                        InstallState: {
                            DISABLED: 'disabled',
                            INSTALLED: 'installed',
                            NOT_INSTALLED: 'not_installed'
                        },
                        RunningState: {
                            CANNOT_RUN: 'cannot_run',
                            READY_TO_RUN: 'ready_to_run',
                            RUNNING: 'running'
                        },
                        getDetails: function() {},
                        getIsInstalled: function() {},
                        installState: function() {},
                        isInstalled: false,
                        runningState: function() {}
                    },
                    runtime: {
                        OnInstalledReason: {},
                        PlatformArch: {},
                        PlatformNaclArch: {},
                        PlatformOs: {},
                        RequestUpdateCheckStatus: {}
                    }
                };

                // Permissions API spoofing
                const originalQuery = window.navigator.permissions.query;
                window.navigator.permissions.query = function(parameters) {
                    return parameters.name === 'notifications' 
                        ? Promise.resolve({ state: Notification.permission })
                        : originalQuery.apply(this, arguments);
                };
            };
            """

            # Session handling
            if session_id:
                context, page, _ = self.sessions.get(session_id, (None, None, None))
                if not context:
                    context = await self.browser.new_context(
                        user_agent=self.user_agent,
                        viewport={"width": 1920, "height": 1080},
                        proxy={"server": self.proxy} if self.proxy else None,
                        accept_downloads=True,
                        java_script_enabled=True
                    )
                    await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
                    await context.set_extra_http_headers(self.headers)
                    page = await context.new_page()
                    self.sessions[session_id] = (context, page, time.time())
            else:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
                    viewport={"width": 1920, "height": 1080},
                    proxy={"server": self.proxy} if self.proxy else None
                )
                await context.set_extra_http_headers(self.headers)

                if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic",
                                                                                                               False):
                    print("[LOG] 🕵️‍♂️ Applying advanced stealth configuration...")
                    await context.add_init_script(advanced_stealth_config)

                page = await context.new_page()
                
            await self._remove_obstacles(page)
            # Add console logging if requested
            if kwargs.get("log_console", False):
                page.on("console", lambda msg: print(f"Console: {msg.text}"))
                page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))

            # Navigation
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")

            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
                response = await page.goto(url, wait_until="domcontentloaded",
                                           timeout=kwargs.get("page_timeout", 60000))
                await self.execute_hook('after_goto', page)
                status_code = response.status
                response_headers = response.headers
            else:
                status_code = 200
                response_headers = {}

            # Human behavior simulation if magic mode is enabled
            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
                await page.evaluate("""
                async () => {
                    // Smooth scrolling
                    const smoothScroll = async () => {
                        const height = Math.max(document.body.scrollHeight, document.documentElement.scrollHeight);
                        const scrollSteps = Math.floor(height / 100);
                        for (let i = 0; i < scrollSteps; i++) {
                            window.scrollTo({
                                top: i * 100,
                                behavior: 'smooth'
                            });
                            await new Promise(r => setTimeout(r, 100 + Math.random() * 400));
                        }
                    };

                    // Mouse movement simulation
                    const simulateMouseMovement = async () => {
                        const moves = 10 + Math.floor(Math.random() * 20);
                        for (let i = 0; i < moves; i++) {
                            const x = Math.random() * window.innerWidth;
                            const y = Math.random() * window.innerHeight;
                            const event = new MouseEvent('mousemove', {
                                view: window,
                                bubbles: true,
                                cancelable: true,
                                clientX: x,
                                clientY: y
                            });
                            document.dispatchEvent(event);
                            await new Promise(r => setTimeout(r, 50 + Math.random() * 200));
                        }
                    };

                    await smoothScroll();
                    await simulateMouseMovement();
                }
                """)

                # Add random delays
                await page.wait_for_timeout(1000 + int(2000 * random.random()))

            # Continue with the rest of your existing crawl method...
            # [Your existing code for wait_for, screenshots, etc.]

            # Get the final HTML
            html = await page.content()

            # Return the response
            return AsyncCrawlResponse(
                html=html,
                response_headers=response_headers,
                status_code=status_code
            )

        except Exception as e:
            raise Error(f"[ERROR] đźš« crawl(): Failed to crawl {url}: {str(e)}")

        finally:
            if not session_id and page:
                await page.close()
            if not session_id and context:
                await context.close()

    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)

        async def crawl_with_semaphore(url):
            async with semaphore:
                return await self.crawl(url, **kwargs)

        tasks = [crawl_with_semaphore(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]

    async def _remove_obstacles(self, page: Page):
        """Remove overlays, cookie notices, and other obstacles"""
        await page.evaluate("""
        () => {
            const removeElements = () => {
                const selectors = [
                    // Cookie and consent related
                    '#didomi-host',
                    '.didomi-popup-backdrop',
                    '.didomi-popup-notice',
                    '.fc-consent-root',
                    '.fc-dialog-overlay',
                    '[class*="cookie-banner"]',
                    '[class*="consent"]',
                    '[class*="gdpr"]',

                    // Modals and overlays
                    '.modal',
                    '.overlay',
                    '[class*="paywall"]',
                    '[class*="subscribe"]',
                    '[role="dialog"]',

                    // Other potential obstacles
                    '.interstitial',
                    '.spinner',
                    '.loading',
                    '#loading',
                ];

                selectors.forEach(selector => {
                    document.querySelectorAll(selector).forEach(elem => {
                        elem.remove();
                    });
                });

                // Fix page scrolling
                document.body.style.overflow = 'auto';
                document.documentElement.style.overflow = 'auto';
                document.body.style.position = 'static';
            };

            // Initial cleanup
            removeElements();

            // Watch for dynamic additions
            const observer = new MutationObserver(removeElements);
            observer.observe(document.body, {
                childList: true,
                subtree: true
            });

            // Stop observing after 10 seconds
            setTimeout(() => observer.disconnect(), 10000);
        }
        """)
        # Small delay to ensure cleanup
        await asyncio.sleep(1)

    async def remove_overlay_elements(self, page: Page) -> None:
        """
        Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.

        Args:
            page (Page): The Playwright page instance
        """
        remove_overlays_js = """
        async () => {
            // Function to check if element is visible
            const isVisible = (elem) => {
                const style = window.getComputedStyle(elem);
                return style.display !== 'none' && 
                       style.visibility !== 'hidden' && 
                       style.opacity !== '0';
            };

            // Common selectors for popups and overlays
            const commonSelectors = [
                // Close buttons first
                'button[class*="close" i]', 'button[class*="dismiss" i]', 
                'button[aria-label*="close" i]', 'button[title*="close" i]',
                'a[class*="close" i]', 'span[class*="close" i]',

                // Cookie notices
                '[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
                '[class*="cookie-consent" i]', '[id*="cookie-consent" i]',

                // Newsletter/subscription dialogs
                '[class*="newsletter" i]', '[class*="subscribe" i]',

                // Generic popups/modals
                '[class*="popup" i]', '[class*="modal" i]', 
                '[class*="overlay" i]', '[class*="dialog" i]',
                '[role="dialog"]', '[role="alertdialog"]'
            ];

            // Try to click close buttons first
            for (const selector of commonSelectors.slice(0, 6)) {
                const closeButtons = document.querySelectorAll(selector);
                for (const button of closeButtons) {
                    if (isVisible(button)) {
                        try {
                            button.click();
                            await new Promise(resolve => setTimeout(resolve, 100));
                        } catch (e) {
                            console.log('Error clicking button:', e);
                        }
                    }
                }
            }

            // Remove remaining overlay elements
            const removeOverlays = () => {
                // Find elements with high z-index
                const allElements = document.querySelectorAll('*');
                for (const elem of allElements) {
                    const style = window.getComputedStyle(elem);
                    const zIndex = parseInt(style.zIndex);
                    const position = style.position;

                    if (
                        isVisible(elem) && 
                        (zIndex > 999 || position === 'fixed' || position === 'absolute') &&
                        (
                            elem.offsetWidth > window.innerWidth * 0.5 ||
                            elem.offsetHeight > window.innerHeight * 0.5 ||
                            style.backgroundColor.includes('rgba') ||
                            parseFloat(style.opacity) < 1
                        )
                    ) {
                        elem.remove();
                    }
                }

                // Remove elements matching common selectors
                for (const selector of commonSelectors) {
                    const elements = document.querySelectorAll(selector);
                    elements.forEach(elem => {
                        if (isVisible(elem)) {
                            elem.remove();
                        }
                    });
                }
            };

            // Remove overlay elements
            removeOverlays();

            // Remove any fixed/sticky position elements at the top/bottom
            const removeFixedElements = () => {
                const elements = document.querySelectorAll('*');
                elements.forEach(elem => {
                    const style = window.getComputedStyle(elem);
                    if (
                        (style.position === 'fixed' || style.position === 'sticky') &&
                        isVisible(elem)
                    ) {
                        elem.remove();
                    }
                });
            };

            removeFixedElements();

            // Remove empty block elements as: div, p, span, etc.
            const removeEmptyBlockElements = () => {
                const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
                blockElements.forEach(elem => {
                    if (elem.innerText.trim() === '') {
                        elem.remove();
                    }
                });
            };

            // Remove margin-right and padding-right from body (often added by modal scripts)
            document.body.style.marginRight = '0px';
            document.body.style.paddingRight = '0px';
            document.body.style.overflow = 'auto';

            // Wait a bit for any animations to complete
            await new Promise(resolve => setTimeout(resolve, 100));
        }
        """

        try:
            await page.evaluate(remove_overlays_js)
            await page.wait_for_timeout(500)  # Wait for any animations to complete
        except Exception as e:
            if self.verbose:
                print(f"Warning: Failed to remove overlay elements: {str(e)}")

    async def take_screenshot(self, page: Page) -> str:
        try:
            # The page is already loaded, just take the screenshot
            screenshot = await page.screenshot(full_page=True)
            return base64.b64encode(screenshot).decode('utf-8')
        except Exception as e:
            error_message = f"Failed to take screenshot: {str(e)}"
            print(error_message)

            # Generate an error image
            img = Image.new('RGB', (800, 600), color='black')
            draw = ImageDraw.Draw(img)
            font = ImageFont.load_default()
            draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)

            buffered = BytesIO()
            img.save(buffered, format="JPEG")
            return base64.b64encode(buffered.getvalue()).decode('utf-8')
        finally:
            await page.close()


YassKhazzan avatar Oct 30 '24 13:10 YassKhazzan

Hi, The 'magic' and 'fit_markdown' both are experimental features and shouldn’t interfere with the library’s main functionality compared to version 0.3.71. The “magic” flag is set to false by default. I’m focusing on refining this approach to extract key parts of an article without relying on large language models, and it’ll take time to reach a stable solution.

I’d appreciate any feedback you have, especially if you encounter issues with standard crawling when the magic flag isn’t enabled. By default, the flag is false, and the fit_markdown shouldn’t affect the regular markdown output. Could you confirm this?

Thanks again for sharing the code suggestion, I’ll definitely review it soon, and update here.

unclecode avatar Oct 30 '24 16:10 unclecode

Hi, version 0.3.71 is less verbose than 0.3.72 when it comes to markdown, which is why I decided to roll back to 0.3.71. The library still functions, but the magic and fit_markdown features aren’t working in this version.

YassKhazzan avatar Oct 30 '24 17:10 YassKhazzan

@YassKhazzan This is very interesting. Would you please help me by sharing the URL and the markdown output when using versions 71 and 72? I want to understand what changes made version 71 less verbose, specifically in the markdown itself, not in footnotes or magic mode. I appreciate your help, thank you.

unclecode avatar Nov 03 '24 06:11 unclecode