WebSearcher icon indicating copy to clipboard operation
WebSearcher copied to clipboard

Parsing Exception 'NoneType' object has no attribute 'children'

Open vishalmohanty opened this issue 3 years ago • 4 comments

Python version: Python 3.8.0 WebSearcher==0.2.12

Hi, I'm running this simple python script saved as searcher.py

import WebSearcher as ws

# Initialize crawler with defaults (headers, logs, ssh tunnels)
se = ws.SearchEngine()
vars(se)

# Conduct Search
se.search('immigration')

# Parse Results
se.parse_results()

Command: python3 searcher.py

Trace

2022-01-31 18:53:59,871 | 54156 | INFO | WebSearcher.searchers | 200 | immigration
2022-01-31 18:53:59,958 | 54156 | ERROR | WebSearcher.parsers | Parsing Exception
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/WebSearcher-0.2.12-py3.8.egg/WebSearcher/parsers.py", line 180, in parse_component
    parsed_cmpt = parser(cmpt)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/WebSearcher-0.2.12-py3.8.egg/WebSearcher/component_parsers/top_stories.py", line 19, in parse_top_stories
    subs = cmpt.find('div', {'class':'qmv19b'}).children
AttributeError: 'NoneType' object has no attribute 'children'

Any pointers on how to resolve?

vishalmohanty avatar Feb 01 '22 02:02 vishalmohanty

try replace the top story file with 👍 def parse_top_stories(cmpt, ctype='top_stories'): """Parse a "Top Stories" component

These components contain links to news articles and often feature an image.
Sometimes the subcomponents are stacked vertically, and sometimes they are
stacked horizontally and feature a larger image, resembling the video 
component.

Args:
    cmpt (bs4 object): A "Top Stories" component

Returns:
    list : list of parsed subcomponent dictionaries
"""
subs = cmpt.find_all('g-inner-card')
if subs:
    return [parse_top_story(sub, ctype, sub_rank) for sub_rank, sub in enumerate(subs)]
else:
    #subs = cmpt.find('div', {'class':'qmv19b'}).children # miri

    subs = cmpt.find('g-section-with-header')
    d= cmpt.find_all(class_='WlydOe')
  #  subs = cmpt.find_all('div', {'class': 'WlydOe'}, recursive=False)
    print ("miri")
   # subs = cmpt.find_all('div', {'class': 'RzdJxc'})
    return [parse_top_story(sub, ctype, sub_rank) for sub_rank, sub in enumerate(d)]

def parse_top_story(sub, ctype, sub_rank=0): """Parse "Top Stories" component

Args:
    sub (bs4 object): A "Top Stories" subcomponent

Returns:
    dict: A parsed subresult
"""
parsed = {'type':ctype, 'sub_rank':sub_rank}
a = sub.find('a')
if a:
    parsed['title'] = a.text if a else None
    parsed['url'] = a['href'] if a else None
else:
    parsed['url'] = sub['href'] if sub else None
    d = sub.find(class_='mCBkyc tNxQIb oz3cqf ynAwRc jBgGLd OSrXXb')
    parsed['title'] = d.text if d else None

cite = sub.find('cite')
parsed['cite'] = cite.text if cite else None

timestamp = sub.find('span', {'class':['f', 'uaCsqe']})
parsed['timestamp'] = timestamp.text if timestamp else None

# Extract component specific details
details = {}
details['img_url'] = get_img_url(sub)
details['orient'] = 'v' if sub.find('span', {'class':'uaCsqe'}) else 'h'
details['live_stamp'] = True if sub.find('span', {'class':'EugGe'}) else False
parsed['details'] = details

return parsed

def get_img_url(soup): """Extract image source"""
img = soup.find('img') if img and 'data-src' in img.attrs: return img.attrs['data-src']

miriYitshaki avatar Feb 01 '22 07:02 miriYitshaki

Thanks! This worked.

vishalmohanty avatar Feb 08 '22 04:02 vishalmohanty

Python version: 3.8.8 WebSearcher version: 0.2.9, 0.2.14 (same error for both)

I am facing the same error running the same code above, in reference to a different line of code:

Traceback (most recent call last):
  File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/searchers.py", line 232, in parse_results
    self.results = parsers.parse_serp(soup, serp_id=self.serp_id)
  File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 216, in parse_serp
    cmpts = extract_components(soup)
  File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 139, in extract_components
    column = extract_results_column(soup)
  File "/Users/vyoma/opt/anaconda3/lib/python3.8/site-packages/WebSearcher/parsers.py", line 56, in extract_results_column
    column = [('main', c) for c in rso.children if c.name not in drop_tags]
AttributeError: 'NoneType' object has no attribute 'children'

How can I go about fixing this?

vyoma-raman avatar Sep 25 '22 20:09 vyoma-raman

Hey all - apologies for the late reply. The current version of WebSearcher is optimized to parse SERPs from 2020 (we developed and evaluated the parser on a large data collection from that time). The html tags and their attributes have changed pretty significantly since then, hence, these errors.

But! We've started to update the parser to handle contemporary SERPs. Hopefully, we can push those updates in the not too distant future.

jlgleason avatar Oct 21 '22 21:10 jlgleason