crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

using js_code and wait_for together is broken in 0.4.22

Open Udbhav8 opened this issue 1 year ago • 11 comments

If i pass in any js_code in the crawler it returns this error Screenshot 2024-12-15 at 2 36 33 PM

i have also explained the issue here

I think commit 0982c63 broke this probably just need a null check for response in here, i fixed it right now with manually copying this file with the null check into my docker build

Udbhav8 avatar Dec 15 '24 22:12 Udbhav8

@Udbhav8 Can you share the code snippet and URL? I can't replicate this error. Please share those with me, and I will see what is causing that. Right npw the following code works well:

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig()

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        page_timeout=60000,
        js_code="(()=> {console.log('hi');})()",
        log_console=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url='https://kidocode.com/',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))

if __name__ == "__main__":
    asyncio.run(main())

You can check this in Colab here https://colab.research.google.com/drive/1Ge5GvHwwAgM9LtIhjjJIcLGx8VXEKq2V?usp=sharing

unclecode avatar Dec 16 '24 08:12 unclecode

 self.crawler_args = {
            "headless": True,
            "remove_overlay_elements": True,
            "verbose": True,
            "always_bypass_cache": True,
            "bypass_cache": True,
            "light_mode": True,
            "user_agent_mode": "random",
            "user_agent_generator_config": {
                "device_type": "mobile",
                "os_type": "android",
            },
        }
        js_code = """
        // Function to check if next page exists and click it
        const nextButton = document.querySelector('kendo-pager-next-buttons span[title="Go to the next page"]');
        console.log('Next button found:', nextButton);
        if (nextButton) {
            nextButton.click();
            console.log('Clicked next button');
        } else {
            console.log('No next button found - might be on last page');
        }
        """

wait_condition = """() => {
            // Then check if document is ready and navigation is complete
            if (document.readyState !== 'complete') {
                console.log('Document not ready yet:', document.readyState);
                return false;
            }

            // Then check for job cells
            const jobCells = document.querySelectorAll('td[kendogridcell] a[href*="/vendor/jobs/details/"]');
            console.log('Number of job cells found:', jobCells.length);
            return jobCells.length > 0;
        }"""
 result = await crawler.arun(
                        session_id=session_id,
                        url="https://app.lotusone.com/#/vendor/jobs",
                        js_code=js_code,
                        wait_for=f"js:{wait_condition}",
                        log_console=True,


                    )

and this is the logs it prints Screenshot 2024-12-16 at 7 07 07 PM

its a page with login so I will also have to give you the cookies for it - could you suggest me a time i can send it to you so it doesn't expire and somewhere to send it to you?

I can also confirm changing the code in async_crawler_strategy.py to this worked for me but now i have to do these changed in my dockerfile for everything to work as expected

                await self.execute_hook("before_goto", page, context=context)

                try:
                    response = await page.goto(
                        url,
                        wait_until=config.wait_until,
                        timeout=config.page_timeout,
                    )
                except Error as e:
                    raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{e!s}")

                await self.execute_hook("after_goto", page, context=context)
                if response:
                    status_code = response.status
                    response_headers = response.headers
                else:
                    status_code = 200
                    response_headers = {}
            else:
                status_code = 200
                response_headers = {}

Udbhav8 avatar Dec 17 '24 03:12 Udbhav8

@Udbhav8 Please try to send me a message by Thursday, 19 December, at 2 p.m. Singapore time. Maybe you can create an entry in the calendar using my email address, and then we can align and communicate together ([email protected]). Besides this, I also suggest that you try to manage the browser, especially for your case. I am providing you with two links to other issues where I gave very detailed answers, and I believe that will help you a lot. Finally, I really want to continue addressing this error. I want to know the situations in which the response is a non-type; that is interesting to me. Before I use an if and else statement to manage it, I need to know when that happens.

https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2541447030 https://github.com/unclecode/crawl4ai/issues/341#issuecomment-2546023875

unclecode avatar Dec 17 '24 08:12 unclecode

@Udbhav8 Please try to send me a message by Thursday, 19 December, at 2 p.m. Singapore time. Maybe you can create an entry in the calendar using my email address, and then we can align and communicate together ([email protected]). Besides this, I also suggest that you try to manage the browser, especially for your case. I am providing you with two links to other issues where I gave very detailed answers, and I believe that will help you a lot. Finally, I really want to continue addressing this error. I want to know the situations in which the response is a non-type; that is interesting to me. Before I use an if and else statement to manage it, I need to know when that happens.

#341 (comment) #341 (comment)

Perfect I have sent you a meeting invite for exactly that time, I will also send you an email with the storage_state exactly at 2pm so you can look in case you aren't able to join the meet

Udbhav8 avatar Dec 19 '24 02:12 Udbhav8

i have sent you an email with the storage_state object from [email protected] @unclecode

Udbhav8 avatar Dec 19 '24 06:12 Udbhav8

let me know if there is another time i can send you the tokens again so we can test syncronously @unclecode

Udbhav8 avatar Dec 21 '24 22:12 Udbhav8

@Udbhav8 I apologize for missing this conversation. Let's schedule another time now. We can plan for either Thursday 26th Dec, or Friday 27th Dec at 2 p.m. Singapore time. Let me know which day works for you, and I'll create the event in the calendar. I will make sure to be available to test the game together. I apologize for the previous one.

unclecode avatar Dec 25 '24 11:12 unclecode

Sorry @unclecode i was out for holidays , why dont we just do this Send me an invite for a meeting on [email protected] and i can make sure i will make it work, just coz i dont get notifications for the github issue updates haha

Udbhav8 avatar Dec 30 '24 01:12 Udbhav8

@Udbhav8 I sent you invitation to Discord there we can chat and plan faster.

unclecode avatar Jan 01 '25 12:01 unclecode

did you send it to my email [email protected] , i haven't recieved anything

Udbhav8 avatar Jan 03 '25 00:01 Udbhav8

Its done

unclecode avatar Jan 05 '25 09:01 unclecode

@Udbhav8 I tried the following code(based on snippet you shared) with the latest version. I can see that both code in js_code and wait_for executed( I could see from console logs and no issue with the response). If the issue still persists, reopen this issue. I think now the site has changed and asking for email before it displays jobs. So you may have to change your code as well.

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
browser_config = BrowserConfig(
    headless=True,
    verbose=True,
    light_mode=True,
    # user_agent_mode="random",
    # user_agent_generator_config={
    #     "device_type": "mobile",
    #     "os_type": "android",
    # },
)

js_code = """
// Function to check if next page exists and click it
const nextButton = document.querySelector('kendo-pager-next-buttons span[title="Go to the next page"]');
console.log('Next button found:', nextButton);
if (nextButton) {
    nextButton.click();
    console.log('Clicked next button');
} else {
    console.log('No next button found - might be on last page');
}
"""

wait_condition = """() => {
// Then check if document is ready and navigation is complete
if (document.readyState !== 'complete') {
    console.log('Document not ready yet:', document.readyState);
    return false;
}

// Then check for job cells
const jobCells = document.querySelectorAll('td[kendogridcell] a[href*="/vendor/jobs/details/"]');
console.log('Number of job cells found:', jobCells.length);
return jobCells.length > 0;
}"""

async def main():
    async with AsyncWebCrawler(config=browser_config) as crawler:

        session_id = "lotusone"
        # Run the crawler on a URL
        result = await crawler.arun(
                        url="https://app.lotusone.com/#/vendor/jobs",
                        config = CrawlerRunConfig(
                        session_id=session_id,
                        remove_overlay_elements=True,
                        js_code=js_code,
                        wait_for=f"js:{wait_condition}",
                        log_console=True)
                    )
        print(result.markdown.raw_markdown)
        # Print the extracted content

asyncio.run(main())

aravindkarnam avatar May 08 '25 06:05 aravindkarnam