Add User-Agent Header to Jsoup Connections in Transforms
This pull request...
- [x] Fixes a bug
- [ ] Introduces a new feature
- [ ] Improves an existing feature
- [ ] Boosts code quality or performance
Description
In transforms, currently a source's url is fetched without specifying user-agent headers. This small PR adds .userAgent("Mozilla") to the line fetching the Document of the url through the Jsoup connection. I hardcoded the value as I saw elsewhere in the codebase doing the same practice. This may be improved by allowing the user-agent to be specified in the configs as part of the transform.
Purpose
When fetching sources in transforms, some servers may block (e.g. 403 Forbidden) due to missing user-agent headers. To fix, set the user-agent to "Mozilla" for the Jsoup connection before fetching the website. This allows roundabout loading from sources that block requests with missing user-agent headers to work.*
*Assuming they accept "Mozilla" as a valid user-agent header. For the source I'm using, it does.
Relevant Issue(s)
N/A (not sure if I should have created an issue first)
Instead of a constant value here, maybe we should pick a good default and then allow setting the value within each transform