facebook-scraper
facebook-scraper copied to clipboard
request_url_callback
Hello I don't understand how request_url_callback works I save last URL and change account after 50 post but it don't continue after last post here is my code snippet
def handle_pagination_url(url):
global start_url
start_url = url
for page in PAGES:
while True:
posts = get_posts(page, credentials=next(accounts),
options={"comments": True,
"reactors": True,
"posts_per_page": 200,
"allow_extra_requests": True,
"sharers": True},
page_limit=None, extra_info=True, timeout=60,
start_url=start_url, request_url_callback=handle_pagination_url)
for i, post in enumerate(posts):
save_post()
if (i+1) % 50 == 0:
print(f"----- account: {c} changing -----")
break
I save posts as a json file. is there any way to use post["post_url"] as start_url
You're requesting 200 posts per page but only reading 50 of them? So naturally there'd be the same posts on that page, because you didn't finish reading the posts on the page. It would make more sense to put your account switching logic in handle_pagination_url
I think.
is there any way to use post["post_url"] as start_url
No. We can only use pagination URLs.
how can I change it in handle_pagination_url? as I know for account changing I need to call get_posts with different credentials and it will generate same page again right?
The way you have it set up, the next time you call get_posts
it will use the next account in the accounts
iterable. So you just need to terminate iteration of the current get_posts
generator. Raising an exception should do it.
so you mean to raise exceptions in handle_pagination_url right?
You're requesting 200 posts per page but only reading 50 of them? So naturally there'd be the same posts on that page, because you didn't finish reading the posts on the page. It would make more sense to put your account switching logic in
handle_pagination_url
I think.is there any way to use post["post_url"] as start_url
No. We can only use pagination URLs.
no. I'm changing account after 50 iterations while cycle still continues so next(accounts) changes accounts and I want to start scraping after 51 post
so you mean to raise exceptions in handle_pagination_url right?
yes
no. I'm changing account after 50 iterations while cycle still continues so next(accounts) changes accounts and I want to start scraping after 51 post
It doesn't make any sense to set posts_per_page
to 200 then
you mean something like that? does it change account after 50 iteration and continues scraping?
def handle_pagination_url(url):
global start_url, post_counter
start_url = url
post_counter += 1
if post_counter % 50 == 0:
raise exceptions.TemporarilyBanned
for page in PAGES:
while True:
posts = get_posts(page, credentials=next(accounts),
options={"comments": True,
"reactors": True,
"posts_per_page": 200,
"allow_extra_requests": True,
"sharers": True},
page_limit=None, extra_info=True, timeout=60,
start_url=start_url, request_url_callback=handle_pagination_url)
for i, post in enumerate(posts):
save_post()
I get ban before 200 posts that's why I want to change account and also using posts_per_page=200 since I found it use less requests. Do I know something wrong?
handle_pagination_url is called for each page, not each post. So more like this:
def handle_pagination_url(url):
global start_url
start_url = url
raise exceptions.TemporarilyBanned
for page in PAGES:
while True:
try:
posts = get_posts(page, credentials=next(accounts),
options={"comments": True,
"reactors": True,
"posts_per_page": 50,
"allow_extra_requests": True,
"sharers": True},
page_limit=None, extra_info=True, timeout=60,
start_url=start_url, request_url_callback=handle_pagination_url)
for post in posts:
save_post()
except:
continue
Thanks, I understand now how it works. I thought I could stop an iteration on a specific post and continue later
You can, but if you do that, when you request the page on which that post resides again you'd see posts you'd already seen
handle_pagination_url is called for each page, not each post. So more like this:
def handle_pagination_url(url): global start_url start_url = url raise exceptions.TemporarilyBanned for page in PAGES: while True: try: posts = get_posts(page, credentials=next(accounts), options={"comments": True, "reactors": True, "posts_per_page": 50, "allow_extra_requests": True, "sharers": True}, page_limit=None, extra_info=True, timeout=60, start_url=start_url, request_url_callback=handle_pagination_url) for post in posts: save_post() except: continue
sorry for the late question but I checked and this does not work it raises exceptions every time and changes accounts without scraping anything. I think get_posts call handle_pagination_url at starting point so account changes and then it happens again
Maybe put your counter back in then, and check for the counter being set to 2
you mean something like that?
def handle_pagination_url(url):
global start_url, callback_lock
start_url = url
if callback_lock:
callback_lock = False
raise exceptions.TemporarilyBanned # to change account
else:
callback_lock = True
callback_lock = False
for page in PAGES:
while True:
try:
posts = get_posts(page, credentials=next(accounts),
options={"comments": True,
"reactors": True,
"posts_per_page": 50,
"allow_extra_requests": True,
"sharers": True},
page_limit=None, extra_info=True, timeout=60,
start_url=start_url, request_url_callback=handle_pagination_url)
for post in posts:
save_post()
except:
continue
yeah