facebook-scraper icon indicating copy to clipboard operation
facebook-scraper copied to clipboard

Limit scrapes before certain date

Open ehsong opened this issue 2 years ago • 2 comments

Hello,

It's my first time using your scraper, and first I would like to thank you so much for creating this, it is super helpful! I was going through the syntax and was wondering how I can scrape posts between Jan 2020 - current. It seems that the scraper is scraping from the recent towards the past. Is this syntax correct?

df = []

for post in get_posts(account='mgchoi86',credentials=(),
                      options={"allow_extra_requests": False, "comments":False, "reactors":False, 
                               "progress":True, "posts_per_page": 200}):
  df.append(post)
if post['time'] < datetime.datetime(2020,1,1):
  break

ehsong avatar Jun 23 '22 16:06 ehsong

Well, I got an error '<' not supported between instances of 'NoneType' and 'datetime.datetime' so I don't think this is the right approach

ehsong avatar Jun 23 '22 16:06 ehsong

That syntax looks mostly fine to me, but I think you probably intended to have the if post['time'] < datetime.datetime(2020,1,1): statement indented within the for loop.

I'm unable to reproduce this bug, your code works fine for me.

for post in get_posts(account='mgchoi86',
                      options={"allow_extra_requests": False, "comments":False, "reactors":False, 
                               "progress":True, "posts_per_page": 200}):
    print(post["time"])
    if post['time'] < datetime.datetime(2020,1,1):
        break

outputs:

2022-06-26 18:35:00
2022-03-07 17:10:00
2022-03-01 11:22:00
2022-03-01 10:25:00
2022-02-02 11:57:00
2022-01-29 20:55:00
2021-08-29 23:33:00
2021-08-18 20:56:00
2021-07-12 15:37:00
2021-04-26 20:59:00
2021-03-26 20:59:00
2021-02-03 23:46:00
2020-12-31 14:16:00
2020-12-27 13:33:00
2020-12-19 16:44:00
2020-12-16 11:11:00
2020-12-02 11:20:00
2020-11-17 06:50:00
2020-11-10 09:28:00
2020-11-05 17:59:00
2020-11-05 10:45:00
2020-09-30 10:57:00
2020-07-29 21:22:00
2020-07-25 12:15:00
2020-06-19 22:58:00
2020-06-19 19:04:00
2020-06-15 10:32:00
2020-06-14 17:44:00
2020-06-11 10:33:00
2020-06-04 19:53:00
2020-05-29 11:28:00
2020-03-16 09:16:00
2020-03-15 17:06:00
2020-03-02 15:15:00
2020-03-01 12:46:00
2020-02-24 14:14:00
2020-02-11 15:56:00
2019-10-15 00:59:00

Perhaps check if post["time"] is not None before trying to compare it to a datetime, like so:

if post["time"] is not None and post['time'] < datetime.datetime(2020,1,1):

neon-ninja avatar Jun 28 '22 04:06 neon-ninja