try to pass headers. but didnt work
def initialize_session(session):
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
session.headers['User-Agent'] = user_agent
p = pool.Pool.from_urls(urls, num_processes=num_processes, initializer=initialize_session)
p.join_all()
try to pass headers. but it didnt work. where i am wrong?
what is the right way to pass headers to Pool?
I have the same question. The documentation provides an example of how you are supposed to set a user-agent when working with a thread pool, but it simply doesn't work. I have also tried returning the modified session object from the initialize_session method (which the documentation claims elsewhere is mandatory), but it doesn't make any difference.
OK, after an hour of debugging, I've worked out the problem. H/t to this SO answer on pdb++ which helped me figure out what was up. You need to call it like this:
import queue
def init_session(session):
session.headers['User-Agent'] = 'my-user-agent/0.1')
return session
job_queue = queue.Queue()
p = pool.Pool(job_queue).from_urls(urls, num_processes=5, initializer=init_session)
p.join_all()
You must return the session from the init_session() method. If you pass the initializer argument to the Pool, all of the sessions created by the Pool will be overwritten by the from_urls() call. The code actually loops through and creates all the sessions twice, once in the Pool and again in from_urls(). If you leave the job_queue positional argument out of the Pool initalization and then try to provide an initializer in from_urls, you'll just get an error:
p = pool.Pool().from_urls(urls, num_processes=5, initializer=init_session)
TypeError: __init__() missing 1 required positional argument: 'job_queue'
This should really be added to the documentation, it's not at all clear at the moment how this is supposed to work.