facebook-scraper icon indicating copy to clipboard operation
facebook-scraper copied to clipboard

missing replies/comment threads

Open AcidkoHorkaCokolada opened this issue 2 years ago • 13 comments

First of all thank you for the awesome code!!

back to my issue -- how can i get replies on comments? i get all commenets under posts, their reactors(haha, wow,..) but not the replies, is there a way to scrap all those comments with nested replies?

from pprint import pprint
from facebook_scraper import *
 
posts = get_posts('1210214419806423', 
                    pages=1, 
                    extra_info=True,
                    credentials = ("user", "pw"),
                    options={"comments": True, "reactors": True, "allow_extra_requests": True, "extra_info": True,"progress":True, "posts_per_page":1, "from_browser": True})
for post in posts:
    pprint(post)

AcidkoHorkaCokolada avatar Jun 02 '22 12:06 AcidkoHorkaCokolada

put the credentials or cookies.json file (email and password)

Moiz-khan avatar Jun 02 '22 15:06 Moiz-khan

put the credentials or cookies.json file (email and password)

i use credentials and even tried it with cookies. - i get post, first level comments but not replies to the comments - no thread

AcidkoHorkaCokolada avatar Jun 02 '22 18:06 AcidkoHorkaCokolada

1210214419806423 is a post, not a page. Your invocation of get_posts is therefore incorrect, you should use the post_urls argument to signify this is a post. The code:

set_cookies("cookies.json")
post = next(get_posts(post_urls=['1210214419806423'], options={"comments": True}))
print(f"Comments: {post['comments']}, Top level comments: {len(post['comments_full'])}, Replies: {sum(len(c['replies']) for c in post['comments_full'])}")

outputs:

Comments: 108, Top level comments: 9, Replies: 66

for me.

neon-ninja avatar Jun 04 '22 04:06 neon-ninja

1210214419806423 is a post, not a page. Your invocation of get_posts is therefore incorrect, you should use the post_urls argument to signify this is a post. The code:

set_cookies("cookies.json")
post = next(get_posts(post_urls=['1210214419806423'], options={"comments": True}))
print(f"Comments: {post['comments']}, Top level comments: {len(post['comments_full'])}, Replies: {sum(len(c['replies']) for c in post['comments_full'])}")

outputs:

Comments: 108, Top level comments: 9, Replies: 66

for me.

it returns different output for me.
Comments: 123, Top level comments: 10, Replies: 0

when i print(post) it returns only comment_text for top level comments(10 comments out of 123), -the way i see it pulls only most relevant comments. is it possible to pull all comments( comment_text) with replies?

AcidkoHorkaCokolada avatar Jun 06 '22 13:06 AcidkoHorkaCokolada

it returns different output for me. Comments: 123, Top level comments: 10, Replies: 0

Try enable debug logging as per the issue template, and then post the logs

when i print(post) it returns only comment_text for top level comments(10 comments out of 123),

The comment count is approximately the sum of the top level comments and the replies. Approximately, because some comments get suppressed as spam

-the way i see it pulls only most relevant comments. is it possible to pull all comments( comment_text) with replies?

No, we're limited by the functionality available on m.facebook.com

neon-ninja avatar Jun 06 '22 19:06 neon-ninja

Thank you, log:

C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\facebook_scraper\facebook_scraper.py:857: UserWarning: Facebook says 'Unsupported Browser'
  warnings.warn(f"Facebook says 'Unsupported Browser'")
Got exact timestamp from publish_time: 2022-06-02 08:35:03
Fetching https://m.facebook.com/hoaxPZ/photos/a.317666309061243/1210208423140356/?type=3&source=57&refid=52&__tn__=EH-R
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_video didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_video_thumbnail didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_video_id didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_video_meta didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_factcheck didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_share_information didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_listing didn't return anything
[pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl] Extract method extract_with didn't return anything
Fetching up to 11 comments
Fetching https://m.facebook.com/story.php?story_fbid=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&id=313187412842466&locale=en_US&story_fbid=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&id=313187412842466&p=10&av=100008157765660&eav=AfZ0PJeg3ORKKz71gxK7NMiQC4gpdt7WYHuzi0bIhdG0FF1P2tpgXxZMXycHoLfyMmg&paipv=0&refid=52
No comments found on page
Fetching /comment/replies/?ctoken=1210214419806423_2253965094755184&count=31&curr&pc=1&isinline&initcomp&ft_ent_identifier=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&eav=AfbMRZHDCLXxWLYA4CYbGfZxsEGFz2gncx-d_af9s1tp1izO02Z17Z68UqMB1bV-82U&av=100008157765660&gfid=AQCJWD-akk3sOx6BCy0&refid=52&__tn__=R
Content Not Found
Fetching /comment/replies/?ctoken=1210214419806423_710454313523867&count=26&curr&pc=1&isinline&initcomp&ft_ent_identifier=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&eav=Afaxquv8U2G9nftRfSan71bI2xVWWSAbQqj0e8Jyt01FRBq1SPB-9fucA1hPD9H6v-c&av=100008157765660&gfid=AQBQdWblPffobUQwwus&refid=52&__tn__=R
Content Not Found
Fetching /comment/replies/?ctoken=1210214419806423_411849410806499&count=21&curr&pc=1&isinline&initcomp&ft_ent_identifier=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&eav=AfZTF6ZXB23RoNPJh7m5LOPLViYEQdZyLTY7iiSJLDan-pbcRXDJbdaTjKoxsREVlNY&av=100008157765660&gfid=AQAI8RBj0n1Ej1H0g9M&refid=52&__tn__=R
Content Not Found
Fetching /comment/replies/?ctoken=1210214419806423_762853668206533&count=2&curr&pc=1&isinline&initcomp&ft_ent_identifier=pfbid023tpAp6bZ14p2bb2sq9GS1kE4zcQMbLKd61noB4AcWe6Sm2op1V6k3qWvKn2R7GJvl&eav=AfanRFVvpFvkGhnxQsu_hePDxfvlSvSSrg2wIFGBTfnkzlCEMn72QTyKdvhjHSyICUM&av=100008157765660&gfid=AQAaI1jTw0m5PgBhno0&refid=52&__tn__=R
Content Not Found
Comments: 123, Top level comments: 10, Replies: 0

AcidkoHorkaCokolada avatar Jun 07 '22 07:06 AcidkoHorkaCokolada

Do you have a noscript cookie? Try update lxml with pip install -U lxml

neon-ninja avatar Jun 07 '22 11:06 neon-ninja

been missing lxml - now it seems i retrieve also replies. not sure what you mean by noscript cookie, my cookies looks like this:

[
    {
      "name": "xxxx",
      "value": "",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx,
      "httpOnly": true,
      "secure": true
    },
    {
      "name": "sb",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx,
      "httpOnly": true,
      "secure": true
    },
    {
      "name": "c_user",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx
      "httpOnly": false,
      "secure": true
    },
    {
      "name": "wd",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx,
      "httpOnly": false,
      "secure": true,
      "sameSite": "Lax"
    },
    {
      "name": "xs",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx,
      "httpOnly": true,
      "secure": true
    },
    {
      "name": "fr",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": xxxx,
      "httpOnly": true,
      "secure": true
    },
    {
      "name": "presence",
      "value": "xxxx",
      "domain": ".facebook.com",
      "path": "/",
      "expires": -1,
      "httpOnly": false,
      "secure": true
    }
  ]

AcidkoHorkaCokolada avatar Jun 08 '22 09:06 AcidkoHorkaCokolada

been missing lxml - now it seems i retrieve also replies. not sure what you mean by noscript cookie, my cookies looks like this:

Hello, how did you extract your cookies? I've your same issue from a month and i've still not resolved yet using lxml and the snippet posted here. I've the same result you got few post before:

Comments: 123, Top level comments: 10, Replies: 0

Ianneee avatar Jun 09 '22 20:06 Ianneee

@neon-ninja what are your suggestion for correctly extract the cookies? I've exported with EditThisCookies and also with Get cookies (using default settings) on chrome but still not working.

Ianneee avatar Jun 16 '22 21:06 Ianneee

Either should work fine. set_cookies should raise an exception if your cookies are invalid, so if it doesn't, your cookies are fine

neon-ninja avatar Jun 16 '22 21:06 neon-ninja

Either should work fine. set_cookies should raise an exception if your cookies are invalid, so if it doesn't, your cookies are fine

I'm starting to suspect that there has been some change on fb because as you say my cookies are correct. Could you describe the steps that occur during the extraction of comments? I would like to try to solve the problem.

Ianneee avatar Jun 17 '22 09:06 Ianneee

Sure. Here's the relevant function - extract_comments_full in extractors.py

https://github.com/kevinzg/facebook-scraper/blob/10ad8b47ad15b175bb474311c3c4e7860b6da5de/facebook_scraper/extractors.py#L1143

This function handles identifying the comments area and paginating through comments. For each comment, this function calls the extract_comment_with_replies function:

https://github.com/kevinzg/facebook-scraper/blob/10ad8b47ad15b175bb474311c3c4e7860b6da5de/facebook_scraper/extractors.py#L1120

This function calls the parse_comment function to parse the top level comment:

https://github.com/kevinzg/facebook-scraper/blob/10ad8b47ad15b175bb474311c3c4e7860b6da5de/facebook_scraper/extractors.py#L1008

If there are replies (as detected by this selector - https://github.com/kevinzg/facebook-scraper/blob/10ad8b47ad15b175bb474311c3c4e7860b6da5de/facebook_scraper/extractors.py#L1127), extract_comment_with_replies calls extract_comment_replies:

https://github.com/kevinzg/facebook-scraper/blob/10ad8b47ad15b175bb474311c3c4e7860b6da5de/facebook_scraper/extractors.py#L1097

For each reply, parse_comment is called to parse the reply comment

Hope that helps!

neon-ninja avatar Jun 28 '22 01:06 neon-ninja