mlscraper
mlscraper copied to clipboard
Adding question mark to the sample fails
The following code,
training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
Throws error
mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'
But the following code works just without the question mark in the html,
training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
The same file worked with autoscraper without any issue.
Thanks, that's very weird.
- Which version are you using?
- since
generate_all_value_matches
just calls BeautifulSoup's find all in the latest version, I have no answer yet.
Does the same happen for <html><body><p>what?</p></body></html>
?
This is how it's meant to be called, not sure what you're trying to achieve.
training_set = TrainingSet()
html = "<html><body><p>with a question mark?</p></body></html>"
page = Page(html)
sample = Sample(page, {
'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
print(scraper)
Prettify creates whitespace that mlscraper currently is sensitive to. I know this is not perfect, but it's on the roadmap.
Related: #15
I found the issue here,
def _generate_find_all(self, item):
assert isinstance(item, str), "can only search for str at the moment"
# text
# - since text matches including whitespace, a regex is used
target_regex = re.compile(r"^\s*%s\s*$" % html.escape(item))
This generates a wrong regex,
with a question mark?
re.compile('^\\s*with a question mark?\\s*$')
Using re.escape
fixes this issue,
def _generate_find_all(self, item):
assert isinstance(item, str), "can only search for str at the moment"
# text
# - since text matches including whitespace, a regex is used
target_regex = re.compile(r"^\s*%s\s*$" % html.escape(re.escape(item)))
Good catch!