langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Respect User-Specified User-Agent in WebBaseLoader

Open lyz1990 opened this issue 1 year ago • 3 comments

Respect User-Specified User-Agent in WebBaseLoader

This pull request modifies the WebBaseLoader class initializer from the langchain.document_loaders.web_base module to preserve any User-Agent specified by the user in the header_template parameter. Previously, even if a User-Agent was specified in header_template, it would always be overridden by a random User-Agent generated by the fake_useragent library.

With this change, if a User-Agent is specified in header_template, it will be used. Only in the case where no User-Agent is specified will a random User-Agent be generated and used. This provides additional flexibility when using the WebBaseLoader class, allowing users to specify their own User-Agent if they have a specific need or preference, while still providing a reasonable default for cases where no User-Agent is specified.

This change has no impact on existing users who do not specify a User-Agent, as the behavior in this case remains the same. However, for users who do specify a User-Agent, their choice will now be respected and used for all subsequent requests made using the WebBaseLoader class.

Fixes #4167

Before submitting

============================= test session starts ============================== collecting ... collected 1 item

test_web_base.py::TestWebBaseLoader::test_respect_user_specified_user_agent

============================== 1 passed in 3.64s =============================== PASSED [100%]

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev

lyz1990 avatar May 12 '23 13:05 lyz1990

@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?

eyurtsev avatar May 12 '23 18:05 eyurtsev

@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?

Sure, I'd be glad to address the linting issue. I'll update the PR shortly. Thank you for the feedback!

lyz1990 avatar May 12 '23 21:05 lyz1990

@lyz1990 change looks good -- thanks for helping out!! Would you be able to resolve the linting issue and then we can merge?

Updated. The function test_respect_user_specified_user_agent in TestWebBaseLoader now has a return type annotation. Please let me know if there are any other changes that need to be made.

lyz1990 avatar May 12 '23 21:05 lyz1990

please merge this fix, waiting for the fix

gauravkesharwani avatar May 14 '23 22:05 gauravkesharwani

please merge this fix, waiting for the fix

While we're waiting for the merge by @eyurtsev , you can use a workaround to manually set the User-Agent in your application code. Here's an example:

loader = WebBaseLoader(url, header_template=your_headers)
loader.session.headers['User-Agent'] = "your user agent"

This way, you can specify your own User-Agent for the time being. Hope this helps!

lyz1990 avatar May 14 '23 23:05 lyz1990

Many thanks for this, Can you add a feature for 'continue_on_failure' like we have in llamaindex. This is will be much more useful in case there is a list of urls passed, and you dont know on which url the parser will fail.

gauravkesharwani avatar May 15 '23 17:05 gauravkesharwani