How to "track" async calls?

Open b-sai opened this issue 1 year ago • 1 comments

I have a series of links I am trying to analyze the redirection for. a simple 301 redirect is not working so using playwright

I know I can do page.url to get the final url in a hook, but I need a way to track original URL and final URL for each link i have.

How can I pass/store this metadata in the hooks?

Dec 04 '24 22:12 b-sai

Hi @b-sai , thx for trying crawl4ai. You can store custom metadata as kwargs when triggering your hooks, and then retrieve or modify them inside the hook callback. For example, you can include something like original_url=... in the execute_hook() call before navigation, and then read or update the final URL in after_goto or on_execution_started hooks. The hooks support arbitrary keyword arguments, so you can pass a dictionary or extra parameters containing the original URL.

For instance:

async def before_goto_hook(page, context=None, **kwargs):
    # kwargs might contain original_url and session_id etc.
    original_url = kwargs.get("original_url")
    # Store original_url somewhere if needed, or print
    print(f"Original URL: {original_url}")

async def after_goto_hook(page, context=None, **kwargs):
    original_url = kwargs.get("original_url")
    final_url = page.url
    print(f"Original URL: {original_url}, Final URL: {final_url}")
    # You can return these values or store them globally

crawler_strategy.set_hook('before_goto', before_goto_hook)
crawler_strategy.set_hook('after_goto', after_goto_hook)

# When calling your crawl method:
await crawler_strategy.execute_hook('before_goto', page, context=context, original_url="http://example.com")
await page.goto("http://example.com")
await crawler_strategy.execute_hook('after_goto', page, context=context, original_url="http://example.com")

By doing this, you are free to pass original_url or any other metadata you need through the execute_hook() calls. Each hook gets the kwargs so you can store and retrieve the needed information across the hooks.

Another approach is using sessions_id. Use session_id to maintain state for each URL:

# Pass metadata through session IDs 
session_id = await crawler.create_session()
result = await crawler.arun(
    url=original_url,
    session_id=session_id,
    before_goto=lambda page: store_original_url(page, original_url),
    after_goto=lambda page: store_final_url(page, page.url)
)

Hopefully this provide the help you need.

Dec 09 '24 10:12 unclecode