facebook-scraper
facebook-scraper copied to clipboard
Fetching posts by post search?
Hi guys, I like the work you have done with this so far. I'm not sure if this is in the pipeline yet, but I would like to enter a search term into Facebook and get back a certain number of posts per page including comments.
For example:
from facebook_scraper import get_posts_by_search, set_cookies
set_cookies("cookies.txt")
search_query = "Mark Zuckerberg"
# generating url search posts request ('https://www.facebook.com/search/posts/?q=Mark%20Zuckerberg')
posts = get_posts_by_search(search_query, pages=3, options={"comments": True, "posts_per_page": 5})
for p in posts:
pass # get a list of posts as a return
I'm happy to help to work this one out as well.
See https://github.com/kevinzg/facebook-scraper/issues/59#issuecomment-830988249 for some related discussion. A pull request for this feature would be welcome
I started playing around with loading posts from search results this week and ran into the issue that Facebook is returning the first result immediately as a response and asynchronously loads more results ~1 second later. This results that I'm only getting 1 post per request as there is no option on pagination for m.facebook.com
. What is the best way to work with the continuous loading logic from Facebook?
The HTML served includes a URL for fetching more results. Search for cursor=
to find this see_more_pager
URL. If you make a request to that URL, you should get more results, possibly in HTML or JSON format.
Sorry, not sure if I can follow... the HTML response from the self.get()
method in facebook_scraper.py
is response = self.session.get(url=url, **self.requests_kwargs, **kwargs)
the URL served is the same one as the initial input URL and does not give me any additional fetching results.
Also, I don't find anything when I search for cursor=
as well as for see_more_pager
.
The cursor URL is in that response. Try log out response.text.
I'm getting the same results our of response.text
as well. Here is a simplified version of the code:
# facebook_scraper.py
def get(self, url, **kwargs):
try:
if not url.startswith("http"):
url = utils.urljoin(FB_MOBILE_BASE_URL, url)
# url is 'https://m.facebook.com/search/posts/?q=nintendo'
response = self.session.get(url=url, **self.requests_kwargs, **kwargs) # <- returns 1 post
time.sleep(10) # <- waiting before returning to the page content
# saving the file for easier read and understanding of the response content
textfile = open("result.html", "w")
a = textfile.write(response.text) # <- returns the same post
textfile.close()
### MORE CODE BELOW HERE ###
Another thought as to why I may only be getting 1 result back is that the screen size of the request is too small to load additional content, and I somehow need to do a scrolldown
event to trigger a reload of the content.
Much appreciate your help as this is a major blocker for me at the moment.
Much appreciate your help on this, this is a big blocker for me right now.
Okay, I got the cursor=
reference now, which is only visible for me using FB_MBASIC_BASE_URL
. So getting multiple pages in is no longer the issue, but the page scraper does no longer work for FB_MBASIC_BASE_URL
, I will start trying to work around that now. But would appreciate a comment on why you have FB_MBASIC_BASE_URL
as a constant but only using FB_MOBILE_BASE_URL
in the current version?
Weird, I get cursor= even with m.facebook.com. I have used mbasic a couple times, but found that I can get mbasic content even on m.facebook.com if I set the noscript=1 cookie, with the set_noscript(True)
function
Another thought as to why I may only be getting 1 result back is that the screen size of the request is too small to load additional content, and I somehow need to do a scrolldown event to trigger a reload of the content.
I think you need to think a little lower level - we're not using a browser, we don't have a screen size or any way of scrolling. We're making web requests and getting HTML/JSON back.
BTW I have seen request headers on my browser when calling facebook.
It is sending viewport-width: 1920
That's probably just for their analytics, I doubt it has any effect on the returned HTML
I need to search by hashtags:
-
www.facebook.com
is very convoluted. -
m.facebook.com
search is not present -
mbasic.facebook.com
is present and it is easier thanwww.facebook.com
I have a WIP on that. The only thing right now i have missing is the custom PostExtractor matching mbasic.
Check here: https://github.com/josx/facebook-scraper/commit/e81e5662b085913ad718072925428e42c8f792e7
Any advice is welcomed
Search is present on m.facebook.com, see https://m.facebook.com/search/posts/?q=search%20query for example. But it seems non-trivial to search for a hashtag, which I think is what you mean.
My mistake, but search for hashtags it is not present in m.facebook.com
Compare
https://mbasic.facebook.com/hashtag/facebook/
https://facebook.com/hashtag/facebook/
with
https://m.facebook.com/hashtag/facebook
I think i found a way to solve this issue, however I cant push my solution with it's branch. what should I do?
Fork the project, and submit a pull request
I requested with new branch named search_word
Hi there, have you checked pull request on this issue?
Merged 👍
Could we search for a query in a specific group with this function?
Something like that
from facebook_scraper import get_posts_by_search, set_cookies
set_cookies("cookies.txt")
search_query = "Mark Zuckerberg"
posts = get_posts_by_search(search_query, group=group_id, options={"comments": True})
for p in posts:
pass # get a list of posts found in this specific group as a return
This isn't possible on m.facebook.com
@neon-ninja well isn't it possible to search for a specific query inside a group by doing it some other way?
You can fetch all posts in the group and filter to just posts containing your desired text
You can fetch all posts in the group and filter to just posts containing your desired text
Yes but what if the group has lots of posts? I can't download all of them and then sort by the match
Why not? This library is capable of scraping thousands or tens or thousands of posts in mere minutes.
hi @neon-ninja , @Ethan353 is it already works to get posts by search ? When I try to run it seems failed to extract the response
Hi @neon-ninja , not sure if this is already implemented. Tried "get_posts_by_search" but did not get any results. Cookies has been passed too.