facebook-scraper icon indicating copy to clipboard operation
facebook-scraper copied to clipboard

request_url_callback

Open RRaphaell opened this issue 1 year ago • 16 comments

Hello I don't understand how request_url_callback works I save last URL and change account after 50 post but it don't continue after last post here is my code snippet

def handle_pagination_url(url):
    global start_url
    start_url = url

for page in PAGES:
    while True:
        posts = get_posts(page, credentials=next(accounts),
                          options={"comments": True, 
                                   "reactors": True, 
                                   "posts_per_page": 200, 
                                   "allow_extra_requests": True,
                                   "sharers": True},
                          page_limit=None, extra_info=True, timeout=60,
                          start_url=start_url, request_url_callback=handle_pagination_url)

        for i, post in enumerate(posts):
            save_post()
            if (i+1) % 50 == 0:
                print(f"----- account: {c} changing -----")
                break

RRaphaell avatar Jul 31 '22 18:07 RRaphaell

I save posts as a json file. is there any way to use post["post_url"] as start_url

RRaphaell avatar Jul 31 '22 18:07 RRaphaell

You're requesting 200 posts per page but only reading 50 of them? So naturally there'd be the same posts on that page, because you didn't finish reading the posts on the page. It would make more sense to put your account switching logic in handle_pagination_url I think.

is there any way to use post["post_url"] as start_url

No. We can only use pagination URLs.

neon-ninja avatar Jul 31 '22 21:07 neon-ninja

how can I change it in handle_pagination_url? as I know for account changing I need to call get_posts with different credentials and it will generate same page again right?

RRaphaell avatar Jul 31 '22 21:07 RRaphaell

The way you have it set up, the next time you call get_posts it will use the next account in the accounts iterable. So you just need to terminate iteration of the current get_posts generator. Raising an exception should do it.

neon-ninja avatar Jul 31 '22 21:07 neon-ninja

so you mean to raise exceptions in handle_pagination_url right?

RRaphaell avatar Jul 31 '22 21:07 RRaphaell

You're requesting 200 posts per page but only reading 50 of them? So naturally there'd be the same posts on that page, because you didn't finish reading the posts on the page. It would make more sense to put your account switching logic in handle_pagination_url I think.

is there any way to use post["post_url"] as start_url

No. We can only use pagination URLs.

no. I'm changing account after 50 iterations while cycle still continues so next(accounts) changes accounts and I want to start scraping after 51 post

RRaphaell avatar Jul 31 '22 21:07 RRaphaell

so you mean to raise exceptions in handle_pagination_url right?

yes

no. I'm changing account after 50 iterations while cycle still continues so next(accounts) changes accounts and I want to start scraping after 51 post

It doesn't make any sense to set posts_per_page to 200 then

neon-ninja avatar Jul 31 '22 21:07 neon-ninja

you mean something like that? does it change account after 50 iteration and continues scraping?

def handle_pagination_url(url):
    global start_url, post_counter
    start_url = url

    post_counter += 1
        
    if post_counter % 50 == 0:
        raise exceptions.TemporarilyBanned

for page in PAGES:
    while True:
        posts = get_posts(page, credentials=next(accounts),
                          options={"comments": True, 
                                   "reactors": True, 
                                   "posts_per_page": 200, 
                                   "allow_extra_requests": True,
                                   "sharers": True},
                          page_limit=None, extra_info=True, timeout=60,
                          start_url=start_url, request_url_callback=handle_pagination_url)

        for i, post in enumerate(posts):
            save_post()

RRaphaell avatar Jul 31 '22 21:07 RRaphaell

I get ban before 200 posts that's why I want to change account and also using posts_per_page=200 since I found it use less requests. Do I know something wrong?

RRaphaell avatar Jul 31 '22 21:07 RRaphaell

handle_pagination_url is called for each page, not each post. So more like this:

def handle_pagination_url(url):
    global start_url
    start_url = url
    raise exceptions.TemporarilyBanned

for page in PAGES:
    while True:
        try:
            posts = get_posts(page, credentials=next(accounts),
                            options={"comments": True, 
                                    "reactors": True, 
                                    "posts_per_page": 50, 
                                    "allow_extra_requests": True,
                                    "sharers": True},
                            page_limit=None, extra_info=True, timeout=60,
                            start_url=start_url, request_url_callback=handle_pagination_url)

            for post in posts:
                save_post()
        except:
            continue

neon-ninja avatar Jul 31 '22 21:07 neon-ninja

Thanks, I understand now how it works. I thought I could stop an iteration on a specific post and continue later

RRaphaell avatar Jul 31 '22 22:07 RRaphaell

You can, but if you do that, when you request the page on which that post resides again you'd see posts you'd already seen

neon-ninja avatar Jul 31 '22 22:07 neon-ninja

handle_pagination_url is called for each page, not each post. So more like this:

def handle_pagination_url(url):
    global start_url
    start_url = url
    raise exceptions.TemporarilyBanned

for page in PAGES:
    while True:
        try:
            posts = get_posts(page, credentials=next(accounts),
                            options={"comments": True, 
                                    "reactors": True, 
                                    "posts_per_page": 50, 
                                    "allow_extra_requests": True,
                                    "sharers": True},
                            page_limit=None, extra_info=True, timeout=60,
                            start_url=start_url, request_url_callback=handle_pagination_url)

            for post in posts:
                save_post()
        except:
            continue

sorry for the late question but I checked and this does not work it raises exceptions every time and changes accounts without scraping anything. I think get_posts call handle_pagination_url at starting point so account changes and then it happens again

RRaphaell avatar Aug 02 '22 16:08 RRaphaell

Maybe put your counter back in then, and check for the counter being set to 2

neon-ninja avatar Aug 02 '22 21:08 neon-ninja

you mean something like that?

def handle_pagination_url(url):
    global start_url, callback_lock
    start_url = url

    if callback_lock:
        callback_lock = False
        raise exceptions.TemporarilyBanned  # to change account
    else:
        callback_lock = True


callback_lock = False

for page in PAGES:
    while True:
        try:
            posts = get_posts(page, credentials=next(accounts),
                            options={"comments": True, 
                                    "reactors": True, 
                                    "posts_per_page": 50, 
                                    "allow_extra_requests": True,
                                    "sharers": True},
                            page_limit=None, extra_info=True, timeout=60,
                            start_url=start_url, request_url_callback=handle_pagination_url)

            for post in posts:
                save_post()
        except:
            continue

RRaphaell avatar Aug 02 '22 21:08 RRaphaell

yeah

neon-ninja avatar Aug 02 '22 21:08 neon-ninja