crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

feat: support custom profile in playwright

Open sherpya opened this issue 1 year ago • 4 comments

if user_data_dir option is found in browser_option, then the launch function used is launch_persistent_context instead of launch and the user_data_dir option is passed to playwright

Description

it makes possible to use custom user provided profile directory with playwright without changing the api

Issues

  • Closes: #399

Testing

def create_crawler(profile: Path):
    browser_options = {
        'user_data_dir': profile
    }

    plugin = PlaywrightBrowserPlugin(browser_type='chromium', browser_options=browser_options)
    browser_pool = BrowserPool(plugins=[plugin])
    crawler = PlaywrightCrawler(browser_pool=browser_pool)

    return crawler

Checklist

  • [ ] CI passed

sherpya avatar Aug 08 '24 08:08 sherpya

it would be better to add a Literal check for the browser with typing.get_args, but this would unify the typevar used in __init__ so I think it's better to have a new PR

Update: ok, I don't like the idea of leaving this commit with an arbitrary getattr, so I added a tuple check for now

sherpya avatar Aug 08 '24 08:08 sherpya

Hello @sherpya and thanks for the contribution! We should have time to review it next week.

janbuchar avatar Aug 12 '24 15:08 janbuchar

Hi, I can make requested changes, perhaps I have no idea how to make a test for a similiar feature, do I just need to see if the call work? Or I may check if somehow the profile is created in the given directory

sherpya avatar Aug 22 '24 09:08 sherpya

Hi, I can make requested changes, perhaps I have no idea how to make a test for a similiar feature, do I just need to see if the call work? Or I may check if somehow the profile is created in the given directory

The bare minimum would be to just run it and make sure it doesn't crash :slightly_smiling_face: But if you could make a profile directory somewhere and find a way to check that it is actually being used by the headless browser, that would be awesome!

janbuchar avatar Aug 22 '24 10:08 janbuchar

Hi, I've got a problem, launch with profile returns a BrowserContext instead of a Browser and browser_controller._browser is <BrowserContext browser=None>, browser_controller.is_browser_connected fails because of missing property in _browser

I've updated the PR but commented out the is_connected check for now

sherpya avatar Aug 23 '24 12:08 sherpya

I've tried with something like (also with assignment), (not proud of):

        if isinstance(browser, BrowserContext):
            setattr(browser, 'is_connected', lambda: not browser._impl_obj._close_was_called)  # noqa: SLF001, B010

but mypy says:

assert not browser_controller.is_browser_connected

is unreachable

I think it should be implemented upstream, but profile dir support would be very nice to have before (in some way)

sherpya avatar Aug 24 '24 13:08 sherpya

The fact that launch_persistent_context returns a BrowserContext while the launch returns a Browser is a problem and our current abstraction is not designed to handle this difference. Addressing it would require deeper intervention into it, which I'm hesitant about.

Isn't there a way to extract a Browser instance from a BrowserContext? Alternatively, isn't there another approach to using a custom profile while still obtaining the Browser instance? I would rather go this way.

vdusek avatar Aug 27 '24 13:08 vdusek

I'll close for now, I will open a new PR if I get a decent way to implement it

sherpya avatar Aug 30 '24 14:08 sherpya