article-extractor icon indicating copy to clipboard operation
article-extractor copied to clipboard

Fetch with proxy?

Open mpuz opened this issue 1 year ago • 4 comments

Having ability to pass proxy and custom headers to extract function would be great.

mpuz avatar Sep 20 '22 21:09 mpuz

@mpuz do you mean something like this?

Screenshot from 2022-09-21 09-23-24

ndaidong avatar Sep 21 '22 02:09 ndaidong

@mpuz please check new parameter fetchOptions to see if that's exactly what you need.

ndaidong avatar Sep 23 '22 10:09 ndaidong

@mpuz please check new parameter fetchOptions to see if that's exactly what you need.

Yes, thats it! thank You - although authorization for proxy often given in a form login:password Also I couldn't find how to prevent parser to remove all newline in the parsed text

mpuz avatar Sep 23 '22 13:09 mpuz

@mpuz you are right, we should have an option to keep or remove newline.

ndaidong avatar Sep 23 '22 14:09 ndaidong

@mpuz since v7.2.4, this lib only replaces multi lines with one linebreak, so we don't need a new option.

I'm considering how to allow user replace default fetch with proxied fetch which is modified by Proxy prototype.

ndaidong avatar Sep 24 '22 05:09 ndaidong

@mpuz since v7.2.4, this lib only replaces multi lines with one linebreak, so we don't need a new option.

I'm considering how to allow user replace default fetch with proxied fetch which is modified by Proxy prototype.

I changed your code like this:

retrieve.js

// utils -> retrieve

import fetch from 'cross-fetch'

export default async (url, opts={}) => {
  const res = await fetch(url, {
    agent: opts?.agent || null,
    headers: {
      'user-agent': opts?.headers || 'Mozilla/5.0 (X11; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0'
    }
  })
  const status = res.status
  if (status >= 400) {
    throw new Error(`Request failed with error code ${status}`)
  }
  const contentType = res.headers.get('content-type') || 'text/html'
  if (!contentType.includes('text/')) {
    throw new Error(`Content type must be "text/html", not "${contentType}"`)
  }
  return res.text()
}

And I use it like this

import { SocksProxyAgent } from 'socks-proxy-agent';

const proxyagent = new SocksProxyAgent(`socks://login:[email protected]:5500`);

const getArticle = async (artUrl) => {
    artUrl = artUrl.split('?')[0]
    let article = await artParser.extract(
        artUrl,
        {
         agent: proxyagent, 
         //headers: HEADERS[4]
        })
        .then((article) => {
        return article
    }).catch((err) => {
        console.log(err, artUrl)
        return null
    })
    return article
}

mpuz avatar Oct 11 '22 09:10 mpuz

@mpuz great, thank you for your suggestion. That works well on Node.js environment where we can take advantage of "Custom Agent" feature from node-fetch. However it seems there is no agent option for native fetch prototype on the browsers and deno environment.

ndaidong avatar Oct 11 '22 23:10 ndaidong