crawlee-python
crawlee-python copied to clipboard
feat: support custom profile in playwright
if user_data_dir option is found in browser_option, then the launch function used is launch_persistent_context instead of launch and the user_data_dir option is passed to playwright
Description
it makes possible to use custom user provided profile directory with playwright without changing the api
Issues
- Closes: #399
Testing
def create_crawler(profile: Path):
browser_options = {
'user_data_dir': profile
}
plugin = PlaywrightBrowserPlugin(browser_type='chromium', browser_options=browser_options)
browser_pool = BrowserPool(plugins=[plugin])
crawler = PlaywrightCrawler(browser_pool=browser_pool)
return crawler
Checklist
- [ ] CI passed
it would be better to add a Literal check for the browser with typing.get_args, but this would unify the typevar used in __init__ so I think it's better to have a new PR
Update: ok, I don't like the idea of leaving this commit with an arbitrary getattr, so I added a tuple check for now
Hello @sherpya and thanks for the contribution! We should have time to review it next week.
Hi, I can make requested changes, perhaps I have no idea how to make a test for a similiar feature, do I just need to see if the call work? Or I may check if somehow the profile is created in the given directory
Hi, I can make requested changes, perhaps I have no idea how to make a test for a similiar feature, do I just need to see if the call work? Or I may check if somehow the profile is created in the given directory
The bare minimum would be to just run it and make sure it doesn't crash :slightly_smiling_face: But if you could make a profile directory somewhere and find a way to check that it is actually being used by the headless browser, that would be awesome!
Hi, I've got a problem, launch with profile returns a BrowserContext instead of a Browser and browser_controller._browser is <BrowserContext browser=None>, browser_controller.is_browser_connected fails because of missing property in _browser
I've updated the PR but commented out the is_connected check for now
I've tried with something like (also with assignment), (not proud of):
if isinstance(browser, BrowserContext):
setattr(browser, 'is_connected', lambda: not browser._impl_obj._close_was_called) # noqa: SLF001, B010
but mypy says:
assert not browser_controller.is_browser_connected
is unreachable
I think it should be implemented upstream, but profile dir support would be very nice to have before (in some way)
The fact that launch_persistent_context returns a BrowserContext while the launch returns a Browser is a problem and our current abstraction is not designed to handle this difference. Addressing it would require deeper intervention into it, which I'm hesitant about.
Isn't there a way to extract a Browser instance from a BrowserContext? Alternatively, isn't there another approach to using a custom profile while still obtaining the Browser instance? I would rather go this way.
I'll close for now, I will open a new PR if I get a decent way to implement it