WebSearcher
WebSearcher copied to clipboard
Parsing Exception 'NoneType' object has no attribute 'children'
Python version: Python 3.8.0
WebSearcher==0.2.12
Hi, I'm running this simple python script saved as searcher.py
import WebSearcher as ws
# Initialize crawler with defaults (headers, logs, ssh tunnels)
se = ws.SearchEngine()
vars(se)
# Conduct Search
se.search('immigration')
# Parse Results
se.parse_results()
Command: python3 searcher.py
Trace
2022-01-31 18:53:59,871 | 54156 | INFO | WebSearcher.searchers | 200 | immigration
2022-01-31 18:53:59,958 | 54156 | ERROR | WebSearcher.parsers | Parsing Exception
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/WebSearcher-0.2.12-py3.8.egg/WebSearcher/parsers.py", line 180, in parse_component
parsed_cmpt = parser(cmpt)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/WebSearcher-0.2.12-py3.8.egg/WebSearcher/component_parsers/top_stories.py", line 19, in parse_top_stories
subs = cmpt.find('div', {'class':'qmv19b'}).children
AttributeError: 'NoneType' object has no attribute 'children'
Any pointers on how to resolve?
try replace the top story file with 👍 def parse_top_stories(cmpt, ctype='top_stories'): """Parse a "Top Stories" component
These components contain links to news articles and often feature an image.
Sometimes the subcomponents are stacked vertically, and sometimes they are
stacked horizontally and feature a larger image, resembling the video
component.
Args:
cmpt (bs4 object): A "Top Stories" component
Returns:
list : list of parsed subcomponent dictionaries
"""
subs = cmpt.find_all('g-inner-card')
if subs:
return [parse_top_story(sub, ctype, sub_rank) for sub_rank, sub in enumerate(subs)]
else:
#subs = cmpt.find('div', {'class':'qmv19b'}).children # miri
subs = cmpt.find('g-section-with-header')
d= cmpt.find_all(class_='WlydOe')
# subs = cmpt.find_all('div', {'class': 'WlydOe'}, recursive=False)
print ("miri")
# subs = cmpt.find_all('div', {'class': 'RzdJxc'})
return [parse_top_story(sub, ctype, sub_rank) for sub_rank, sub in enumerate(d)]
def parse_top_story(sub, ctype, sub_rank=0): """Parse "Top Stories" component
Args:
sub (bs4 object): A "Top Stories" subcomponent
Returns:
dict: A parsed subresult
"""
parsed = {'type':ctype, 'sub_rank':sub_rank}
a = sub.find('a')
if a:
parsed['title'] = a.text if a else None
parsed['url'] = a['href'] if a else None
else:
parsed['url'] = sub['href'] if sub else None
d = sub.find(class_='mCBkyc tNxQIb oz3cqf ynAwRc jBgGLd OSrXXb')
parsed['title'] = d.text if d else None
cite = sub.find('cite')
parsed['cite'] = cite.text if cite else None
timestamp = sub.find('span', {'class':['f', 'uaCsqe']})
parsed['timestamp'] = timestamp.text if timestamp else None
# Extract component specific details
details = {}
details['img_url'] = get_img_url(sub)
details['orient'] = 'v' if sub.find('span', {'class':'uaCsqe'}) else 'h'
details['live_stamp'] = True if sub.find('span', {'class':'EugGe'}) else False
parsed['details'] = details
return parsed
def get_img_url(soup):
"""Extract image source"""
img = soup.find('img')
if img and 'data-src' in img.attrs:
return img.attrs['data-src']
Thanks! This worked.
Python version: 3.8.8 WebSearcher version: 0.2.9, 0.2.14 (same error for both)
I am facing the same error running the same code above, in reference to a different line of code:
Traceback (most recent call last):
File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/searchers.py", line 232, in parse_results
self.results = parsers.parse_serp(soup, serp_id=self.serp_id)
File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 216, in parse_serp
cmpts = extract_components(soup)
File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 139, in extract_components
column = extract_results_column(soup)
File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 56, in extract_results_column
column = [('main', c) for c in rso.children if c.name not in drop_tags]
AttributeError: 'NoneType' object has no attribute 'children'
How can I go about fixing this?
Hey all - apologies for the late reply. The current version of WebSearcher is optimized to parse SERPs from 2020 (we developed and evaluated the parser on a large data collection from that time). The html tags and their attributes have changed pretty significantly since then, hence, these errors.
But! We've started to update the parser to handle contemporary SERPs. Hopefully, we can push those updates in the not too distant future.