article-extractor
article-extractor copied to clipboard
Fetch with proxy?
Having ability to pass proxy and custom headers to extract function would be great.
@mpuz do you mean something like this?
@mpuz please check new parameter fetchOptions to see if that's exactly what you need.
@mpuz please check new parameter fetchOptions to see if that's exactly what you need.
Yes, thats it! thank You - although authorization for proxy often given in a form login:password Also I couldn't find how to prevent parser to remove all newline in the parsed text
@mpuz you are right, we should have an option to keep or remove newline.
@mpuz since v7.2.4, this lib only replaces multi lines with one linebreak, so we don't need a new option.
I'm considering how to allow user replace default fetch with proxied fetch which is modified by Proxy prototype.
@mpuz since v7.2.4, this lib only replaces multi lines with one linebreak, so we don't need a new option.
I'm considering how to allow user replace default fetch with proxied fetch which is modified by Proxy prototype.
I changed your code like this:
retrieve.js
// utils -> retrieve
import fetch from 'cross-fetch'
export default async (url, opts={}) => {
const res = await fetch(url, {
agent: opts?.agent || null,
headers: {
'user-agent': opts?.headers || 'Mozilla/5.0 (X11; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
}
})
const status = res.status
if (status >= 400) {
throw new Error(`Request failed with error code ${status}`)
}
const contentType = res.headers.get('content-type') || 'text/html'
if (!contentType.includes('text/')) {
throw new Error(`Content type must be "text/html", not "${contentType}"`)
}
return res.text()
}
And I use it like this
import { SocksProxyAgent } from 'socks-proxy-agent';
const proxyagent = new SocksProxyAgent(`socks://login:[email protected]:5500`);
const getArticle = async (artUrl) => {
artUrl = artUrl.split('?')[0]
let article = await artParser.extract(
artUrl,
{
agent: proxyagent,
//headers: HEADERS[4]
})
.then((article) => {
return article
}).catch((err) => {
console.log(err, artUrl)
return null
})
return article
}
@mpuz great, thank you for your suggestion. That works well on Node.js environment where we can take advantage of "Custom Agent" feature from node-fetch
.
However it seems there is no agent
option for native fetch
prototype on the browsers and deno environment.