mlscraper icon indicating copy to clipboard operation
mlscraper copied to clipboard

Adding question mark to the sample fails

Open entrptaher opened this issue 1 year ago • 8 comments

The following code,

training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

Throws error

mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'

But the following code works just without the question mark in the html,

training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
                'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)

entrptaher avatar May 01 '23 08:05 entrptaher

The same file worked with autoscraper without any issue.

entrptaher avatar May 01 '23 08:05 entrptaher

Thanks, that's very weird.

  • Which version are you using?
  • since generate_all_value_matches just calls BeautifulSoup's find all in the latest version, I have no answer yet.

lorey avatar May 01 '23 14:05 lorey

Does the same happen for <html><body><p>what?</p></body></html>?

lorey avatar May 01 '23 14:05 lorey

This is how it's meant to be called, not sure what you're trying to achieve.

    training_set = TrainingSet()
    html = "<html><body><p>with a question mark?</p></body></html>"
    page = Page(html)
    sample = Sample(page, {
        'title': 'with a question mark?'})
    training_set.add_sample(sample)
    scraper = train_scraper(training_set)
    print(scraper)

lorey avatar May 01 '23 14:05 lorey

Prettify creates whitespace that mlscraper currently is sensitive to. I know this is not perfect, but it's on the roadmap.

lorey avatar May 01 '23 14:05 lorey

Related: #15

lorey avatar May 01 '23 14:05 lorey

I found the issue here,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(item))

This generates a wrong regex,

with a question mark?
re.compile('^\\s*with a question mark?\\s*$')

Using re.escape fixes this issue,

    def _generate_find_all(self, item):
        assert isinstance(item, str), "can only search for str at the moment"

        # text
        # - since text matches including whitespace, a regex is used
        target_regex = re.compile(r"^\s*%s\s*$" % html.escape(re.escape(item)))

entrptaher avatar May 02 '23 04:05 entrptaher

Good catch!

lorey avatar May 02 '23 07:05 lorey