langchain
langchain copied to clipboard
Respect User-Specified User-Agent in WebBaseLoader
Respect User-Specified User-Agent in WebBaseLoader
This pull request modifies the WebBaseLoader
class initializer from the langchain.document_loaders.web_base
module to preserve any User-Agent specified by the user in the header_template
parameter. Previously, even if a User-Agent was specified in header_template
, it would always be overridden by a random User-Agent generated by the fake_useragent
library.
With this change, if a User-Agent is specified in header_template
, it will be used. Only in the case where no User-Agent is specified will a random User-Agent be generated and used. This provides additional flexibility when using the WebBaseLoader
class, allowing users to specify their own User-Agent if they have a specific need or preference, while still providing a reasonable default for cases where no User-Agent is specified.
This change has no impact on existing users who do not specify a User-Agent, as the behavior in this case remains the same. However, for users who do specify a User-Agent, their choice will now be respected and used for all subsequent requests made using the WebBaseLoader
class.
Fixes #4167
Before submitting
============================= test session starts ============================== collecting ... collected 1 item
test_web_base.py::TestWebBaseLoader::test_respect_user_specified_user_agent
============================== 1 passed in 3.64s =============================== PASSED [100%]
Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev
@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?
@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?
Sure, I'd be glad to address the linting issue. I'll update the PR shortly. Thank you for the feedback!
@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?
Updated. The function test_respect_user_specified_user_agent
in TestWebBaseLoader now has a return type annotation.
Please let me know if there are any other changes that need to be made.
please merge this fix, waiting for the fix
please merge this fix, waiting for the fix
While we're waiting for the merge by @eyurtsev , you can use a workaround to manually set the User-Agent in your application code. Here's an example:
loader = WebBaseLoader(url, header_template=your_headers)
loader.session.headers['User-Agent'] = "your user agent"
This way, you can specify your own User-Agent for the time being. Hope this helps!
Many thanks for this, Can you add a feature for 'continue_on_failure' like we have in llamaindex. This is will be much more useful in case there is a list of urls passed, and you dont know on which url the parser will fail.