facebook-scraper
facebook-scraper copied to clipboard
weird get_posts() behavior/bug
I normally call get_posts()
like this:
posts = get_posts(username, cookies=cookie_file, extra_info=True,
options={'page_limit': None, 'allow_extra_requests': False, 'HQ_images': False})
But for posts 10215606930220434, 10111743443272349 and 10229044620250382 this doesn't return the correct number of likes/reaction_count (it returns 0) but this:
posts = get_posts(username, cookies=cookie_file, extra_info=True,
options={'page_limit': None, 'allow_extra_requests': False, 'reactors': True, 'HQ_images': False})
does. For every other post I've seen, they both return the number of likes/reaction_count properly.
This looks to be caused by malformed HTML served by FB, resulting in lxml not putting the footer element in the article element. As a workaround, you can re-fetch these failed posts like so:
set_cookies("cookies.txt")
posts = get_posts(post_urls=[10215606930220434, 10111743443272349, 10229044620250382 ], options={'allow_extra_requests': False})
for post in posts:
print(post["likes"], post["comments"])
outputs:
74 1
224 36
42 18
Thanks.
Now it seems that
posts = get_posts(username, cookies=cookie_file, extra_info=True,
options={'page_limit': None, 'allow_extra_requests': False, 'HQ_images': False})
never returns non-0 values for likes/reaction_count. Does this mean Facebook is changing the HTML format overall?
I'm trying to minimize request count, so I'd like to avoid having to get posts one by one.
What username are you using? I tried with dudukovich, and still get the error.
set_cookies("cookies.txt")
posts = get_posts("dudukovich", pages=1, options={'allow_extra_requests': False})
for post in posts:
if not post["likes"]:
pprint(post)
break
outputs:
{'available': True,
'comments': 0,
'comments_full': None,
'factcheck': None,
'image': None,
'image_id': None,
'image_ids': [],
'image_lowquality': None,
'images': None,
'images_description': None,
'images_lowquality': [],
'images_lowquality_description': [],
'is_live': False,
'likes': 0,
'link': None,
'links': [],
'original_text': None,
'page_id': None,
'post_id': '10229044620250382',
'post_text': 'Finally nailed it. My work here is done.\n'
'\n'
'Wordle 400 1/6\n'
'\n'
'š©š©š©š©š©',
'post_url': 'https://facebook.com/dudukovich/posts/10229044620250382',
'reaction_count': None,
'reactions': None,
'reactors': None,
'shared_post_id': None,
'shared_post_url': None,
'shared_text': '',
'shared_time': None,
'shared_user_id': None,
'shared_username': None,
'sharers': None,
'shares': 0,
'text': 'Finally nailed it. My work here is done.\n\nWordle 400 1/6\n\nš©š©š©š©š©',
'time': datetime.datetime(2022, 7, 24, 8, 37),
'timestamp': None,
'user_id': 1539088457,
'user_url': 'https://facebook.com/dudukovich?lst=100068943456113%3A1539088457%3A1658786006&refid=17&_ft_=encrypted_tracking_data.0AY_nlA7aEd7s28Fqlm04ViLeX4ILbGh4rrazl3Mj6V2NvD02jBBFJgB4g5JCxg2Wxosvx-eiZpJoDJX_SFTMj-Wy8uHmSNX1PGpNmwlnSknHY1LT3psXmtLY3yOCKRxjyjCzW_7acSga1TPgOsVj8VxoiqLHcQBMpzkx0W3mU1ZxPbN5MlEjKl78LjBUcljP7ioaLcGQ-IHkIJPoJpBNqKHYC8EniRDryYOTsM-DPs2blBJ33x0Q3elahnLlVjxChOVGdunr-31mv5htJHdQpGVod8BRK_gEjNNWFcZ636FJWP4VMs66fgzGtQkYV4Tgr7Vbaju81aoc-zpTvmxSUDIAiILKZYvsV5ldRIeOf-8YSkok2TnVhFq7UkBNmwj1Hew8XdmMDb41iVkmu6ZiRhymDAdilV5JVP8bMcOoWUnOT53WIwE9l_bPK4Twb7cfR9mzLKp55f9sYxS4BDKzh_2cgwjweGfCPoHcMoTbpsPSUx96B2aGRJ3kDvJDKPfspQsfOmP9o6IMquqtUJJogFwtApVtzLqXc6owhZr4s4QD3riTNgchT8zmBbTSWODcvXRDHZm7-CJGWEthfjGQ-SCFqM5q8go6iNhigu8VlzC6_0Sc&__tn__=C-R',
'username': 'Jim Dudukovich',
'video': None,
'video_duration_seconds': None,
'video_height': None,
'video_id': None,
'video_quality': None,
'video_size_MB': None,
'video_thumbnail': None,
'video_watches': None,
'video_width': None,
'w3_fb_url': None,
'was_live': False,
'with': None}
Yes, I do, too - on all usernames now. Until the day I posted the first report in this thread, it worked for all usernames on almost all posts, and now it just doesn't work.
I don't want to get reactors unless I know that there have been new ones because I am trying to minimize requests, so I need to fetch the reaction counts first.
I see - try https://github.com/kevinzg/facebook-scraper/commit/c4ffccc681b61372f7bf2d85833ac1873c98ed80 With this commit and this test code:
set_cookies("cookies.txt")
posts = get_posts("dudukovich", pages=1, options={'allow_extra_requests': False})
for post in posts:
print(post["post_id"], post["likes"], post["comments"])
I get:
10229047171274156 18 2
10229044620250382 42 18
10228996395204786 38 29
10228978022425478 7 3
10228975542723487 23 2
10228952791554722 30 6
10228940462806511 17 2
10228840341583543 13 5
10228778696482454 13 2
10228776189019769 77 19
Yes, thanks, this works great.
It doesn't seem to set reaction_count
. I saw a related comment of yours recently, but I didn't completely understand it. When does reaction_count
get set?
Also, it looks like the pages
parameter now works for profiles? Am I remembering correctly that that didn't used to be the case?
Thanks, again.
https://github.com/kevinzg/facebook-scraper/commit/40c1e8a6f81d7a89256abaa0811b301875e1a6d8 should set reaction_count. Usually this would only get set if you set options: "reactions"
, but this would involve an extra request (to something like https://m.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=10229047171274156)
I think you're thinking of the posts_per_page
parameter