facebook-scraper icon indicating copy to clipboard operation
facebook-scraper copied to clipboard

Fetching posts by post search?

Open exnerfelix opened this issue 2 years ago • 24 comments

Hi guys, I like the work you have done with this so far. I'm not sure if this is in the pipeline yet, but I would like to enter a search term into Facebook and get back a certain number of posts per page including comments.

For example:

from facebook_scraper import get_posts_by_search, set_cookies

set_cookies("cookies.txt")
search_query = "Mark Zuckerberg"

# generating url search posts request ('https://www.facebook.com/search/posts/?q=Mark%20Zuckerberg')
posts = get_posts_by_search(search_query, pages=3, options={"comments": True, "posts_per_page": 5})

for p in posts:
    pass # get a list of posts as a return

I'm happy to help to work this one out as well.

exnerfelix avatar Aug 03 '21 16:08 exnerfelix

See https://github.com/kevinzg/facebook-scraper/issues/59#issuecomment-830988249 for some related discussion. A pull request for this feature would be welcome

neon-ninja avatar Aug 03 '21 20:08 neon-ninja

I started playing around with loading posts from search results this week and ran into the issue that Facebook is returning the first result immediately as a response and asynchronously loads more results ~1 second later. This results that I'm only getting 1 post per request as there is no option on pagination for m.facebook.com. What is the best way to work with the continuous loading logic from Facebook?

exnerfelix avatar Aug 25 '21 23:08 exnerfelix

The HTML served includes a URL for fetching more results. Search for cursor= to find this see_more_pager URL. If you make a request to that URL, you should get more results, possibly in HTML or JSON format.

neon-ninja avatar Aug 26 '21 00:08 neon-ninja

Sorry, not sure if I can follow... the HTML response from the self.get() method in facebook_scraper.py is response = self.session.get(url=url, **self.requests_kwargs, **kwargs) the URL served is the same one as the initial input URL and does not give me any additional fetching results.

Also, I don't find anything when I search for cursor= as well as for see_more_pager .

exnerfelix avatar Aug 26 '21 02:08 exnerfelix

The cursor URL is in that response. Try log out response.text.

neon-ninja avatar Aug 26 '21 04:08 neon-ninja

I'm getting the same results our of response.text as well. Here is a simplified version of the code:

    # facebook_scraper.py
    def get(self, url, **kwargs):
        try:
            if not url.startswith("http"):
                url = utils.urljoin(FB_MOBILE_BASE_URL, url)
            
            # url is 'https://m.facebook.com/search/posts/?q=nintendo'
            response = self.session.get(url=url, **self.requests_kwargs, **kwargs) # <- returns 1 post
            
            time.sleep(10) # <- waiting before returning to the page content
            # saving the file for easier read and understanding of the response content
            textfile = open("result.html", "w")
            a = textfile.write(response.text) # <- returns the same post
            textfile.close()

            ### MORE CODE BELOW HERE ###

Another thought as to why I may only be getting 1 result back is that the screen size of the request is too small to load additional content, and I somehow need to do a scrolldown event to trigger a reload of the content.

Much appreciate your help as this is a major blocker for me at the moment.

Much appreciate your help on this, this is a big blocker for me right now.

exnerfelix avatar Aug 26 '21 16:08 exnerfelix

Okay, I got the cursor= reference now, which is only visible for me using FB_MBASIC_BASE_URL. So getting multiple pages in is no longer the issue, but the page scraper does no longer work for FB_MBASIC_BASE_URL, I will start trying to work around that now. But would appreciate a comment on why you have FB_MBASIC_BASE_URL as a constant but only using FB_MOBILE_BASE_URL in the current version?

exnerfelix avatar Aug 26 '21 20:08 exnerfelix

Weird, I get cursor= even with m.facebook.com. I have used mbasic a couple times, but found that I can get mbasic content even on m.facebook.com if I set the noscript=1 cookie, with the set_noscript(True) function

Another thought as to why I may only be getting 1 result back is that the screen size of the request is too small to load additional content, and I somehow need to do a scrolldown event to trigger a reload of the content.

I think you need to think a little lower level - we're not using a browser, we don't have a screen size or any way of scrolling. We're making web requests and getting HTML/JSON back.

neon-ninja avatar Aug 26 '21 21:08 neon-ninja

BTW I have seen request headers on my browser when calling facebook. It is sending viewport-width: 1920

josx avatar Nov 26 '21 20:11 josx

That's probably just for their analytics, I doubt it has any effect on the returned HTML

neon-ninja avatar Nov 28 '21 20:11 neon-ninja

I need to search by hashtags:

  • www.facebook.com is very convoluted.
  • m.facebook.com search is not present
  • mbasic.facebook.com is present and it is easier than www.facebook.com

I have a WIP on that. The only thing right now i have missing is the custom PostExtractor matching mbasic.

Check here: https://github.com/josx/facebook-scraper/commit/e81e5662b085913ad718072925428e42c8f792e7

Any advice is welcomed

josx avatar Nov 29 '21 20:11 josx

Search is present on m.facebook.com, see https://m.facebook.com/search/posts/?q=search%20query for example. But it seems non-trivial to search for a hashtag, which I think is what you mean.

neon-ninja avatar Nov 29 '21 20:11 neon-ninja

My mistake, but search for hashtags it is not present in m.facebook.com

Compare https://mbasic.facebook.com/hashtag/facebook/ https://facebook.com/hashtag/facebook/ with https://m.facebook.com/hashtag/facebook

josx avatar Nov 29 '21 20:11 josx

I think i found a way to solve this issue, however I cant push my solution with it's branch. what should I do?

Ethan353 avatar Dec 06 '21 01:12 Ethan353

Fork the project, and submit a pull request

neon-ninja avatar Dec 06 '21 04:12 neon-ninja

I requested with new branch named search_word

Ethan353 avatar Dec 06 '21 06:12 Ethan353

Hi there, have you checked pull request on this issue?

Ethan353 avatar Dec 19 '21 11:12 Ethan353

Merged 👍

neon-ninja avatar Dec 19 '21 23:12 neon-ninja

Could we search for a query in a specific group with this function?

Something like that

from facebook_scraper import get_posts_by_search, set_cookies

set_cookies("cookies.txt")
search_query = "Mark Zuckerberg"

posts = get_posts_by_search(search_query, group=group_id, options={"comments": True})

for p in posts:
    pass # get a list of posts found in this specific group as a return

gamcoh avatar Mar 04 '22 10:03 gamcoh

This isn't possible on m.facebook.com

neon-ninja avatar Mar 29 '22 23:03 neon-ninja

@neon-ninja well isn't it possible to search for a specific query inside a group by doing it some other way?

gamcoh avatar Mar 30 '22 07:03 gamcoh

You can fetch all posts in the group and filter to just posts containing your desired text

neon-ninja avatar Mar 30 '22 07:03 neon-ninja

You can fetch all posts in the group and filter to just posts containing your desired text

Yes but what if the group has lots of posts? I can't download all of them and then sort by the match

gamcoh avatar Mar 30 '22 08:03 gamcoh

Why not? This library is capable of scraping thousands or tens or thousands of posts in mere minutes.

neon-ninja avatar Mar 30 '22 20:03 neon-ninja

hi @neon-ninja , @Ethan353 is it already works to get posts by search ? When I try to run it seems failed to extract the response

dckkk avatar Jan 15 '23 18:01 dckkk

Hi @neon-ninja , not sure if this is already implemented. Tried "get_posts_by_search" but did not get any results. Cookies has been passed too.

ericleong86 avatar Feb 08 '23 01:02 ericleong86