scrapely
scrapely copied to clipboard
ZeroDivisionError when training with zero-length data
(Minor bug.) I installed scrapely from pip this morning.
This is a wacky edge case, but I think you could raise a more constructive error.
(Who wants to extract a zero-length string from a document? It's a bit like a magician pulling some atmosphere out of a hat: it's always going to be there...)
Check it out:
In [97]: from scrapely import Scraper
In [98]: s = Scraper()
In [99]: s.train('http://www.google.com', {'image': u''})
- - - - - - - - - - - - - - - - -
ZeroDivisionError Traceback (most recent call last)
/home/username/myfolder/<ipython-input-99-233d0ac90e7f> in <module>()
----> 1 s.train('http://www.google.com', {'image': u''})
/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train(self, url, data, encoding)
44 def train(self, url, data, encoding=None):
45 page = url_to_page(url, encoding)
---> 46 self.train_from_htmlpage(page, data)
47
48 def scrape(self, url, encoding=None):
/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train_from_htmlpage(self, htmlpage, data)
39 if isinstance(value, str):
40 value = value.decode(htmlpage.encoding or 'utf-8')
---> 41 tm.annotate(field, best_match(value))
42 self.add_template(tm.get_template())
43
/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in annotate(self, field, score_func, best_match)
31
32 """
---> 33 indexes = self.select(score_func)
34 if not indexes:
35 raise FragmentNotFound("Fragment not found annotating %r using: %s" %
/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in select(self, score_func)
46 matches = []
47 for i, fragment in enumerate(htmlpage.parsed_body):
---> 48 score = score_func(fragment, htmlpage)
49 if score:
50 matches.append((score, i))
/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in func(fragment, page)
95 fdata = page.fragment_data(fragment).strip()
96 if text in fdata:
---> 97 return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
98 else:
99 return 0.0
ZeroDivisionError: float division by zero
This is the reason for the error.
return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
If the float that is being returned is inversely proportional to length of fdata, can we just write this.?
fdata = page.fragment_data(fragment).strip()
if text in fdata:
if not len(fdata):
return float("inf")
return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
else:
return 0.0
return func
This isn't a wacky edge-case at all.
I got the same error using actual data and had to patch it.
Same here, I reproduced this error using regular, non-empty data.
the patch has been merged, I believe this issue can be closed?