facebook-post-scraper
facebook-post-scraper copied to clipboard
Fixed _extract_post_id returning null values caused by multiple items in page with class _5pcq
_extract_post_id
uses the class _5pcq
to get the postId. This sometimes fails when there is a long list of items with that class, some of which have an effectively empty href
tag. Propose a fix where we check if the href
tag contains a URL (beginning with '/') before extracting.
Sample list of prints from item.find_all(class_="_5pcq")
below. The last item on the list has '#' in the href
field. This causes the resultant post_id
to be '#'.
To replicate:
python scraper.py -p TheStraitsTimes -l 1
<a class="_5pcq" href="/TheStraitsTimes/posts/10157114673327115?__xts__%5B0%5D=68.ARAQBh0K_NMj_mQAANUH_3XvHEDd3zLc83FLEcu4VcfDAdkM6z1PAP4Izat-cL4tQmNTMr_W875cfYO3vqYneCqXcjuRt9Q1tiYK64NKoaEUtHoyIyAjcZi6jHtUrCB60YZfPvwidqL6Aw6Vm7yIdE7amIjP-yTjI25iMi-EH7xYHzCLxG1U83eUuG-L4xX73BaqcA8MtjD6aeI-EFfelvwRVHDV5GlwwgN2cGDrcv5_--KTGPV8mNO9UFtcj4BdxBG45bb4QZrpTE-PxmdnjHAIjbauy89o3zXPRG8t5LsfThBfy5UYs0M3PcVsiJi8UJswS-_QJDDTwFMnozEp&__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591800911" title="Wednesday, June 10, 2020 at 7:55 AM"><span class="timestampContent" id="js_6">1 hr</span></abbr></a>
<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>
<a class="_5pcq" href="/TheStraitsTimes/posts/10157114743227115?__xts__%5B0%5D=68.ARCAUPaCRHFNlhpvP2W3jDKjTebqzmTZplSSOjw7Q6sLY5VjEDPitgFQ1kYPbbGkhEiMNdN4ZLR2BjaCFdWQe5V3pDbTZ73LXDRBsFjuGX_WX2BFnx0r1xDjP2OXiYNx9B1YJOEmnVbPJg6M817WmCRTUmSjsCECgHKDAaLin8z7bP3s0XjTqaEXxtmINF7Beqwi4lqMhx8D8HQG5rgZqFzCjMOpo8s_glZV36SHwX1z2fFLpF4iudosAK-005XvhBBIfs66n5UZe9AsQmvd0QsMbjfVQIN_JqGY4-mn8VjW8XjZRzBKFEUCir2efcX5bAitc0MVnQ2Fdn0Pdzyj&__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591803018" title="Wednesday, June 10, 2020 at 8:30 AM"><span class="timestampContent" id="js_9">1 hr</span></abbr></a>
<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>