mlscraper icon indicating copy to clipboard operation
mlscraper copied to clipboard

Spiegel example from Gist

Open antonengelhardt opened this issue 3 years ago • 3 comments

This is example is from a gist from @lorey (Author).

antonengelhardt avatar Apr 14 '23 09:04 antonengelhardt

@lorey Here you go. Can you please check if this runs on your system?

When i run it, i get an error.

antonengelhardt avatar Apr 14 '23 09:04 antonengelhardt

Thanks for adding.

What's the error?

lorey avatar Apr 14 '23 11:04 lorey

@lorey These are just the last few lines. Do you want me to post the whole output?

INFO:root:found len(value_matches)=2 on page (self.value='24.06.2022, 14.26 Uhr', self.page=<Page self.soup.name='[document]' classes=None, text=Kristina H...>)
INFO:root:value_matches=[<ValueMatch self.node=<Node self.soup.name='time' classes=['timeformat'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>, <ValueMatch self.node=<Node self.soup.name='div' classes=['font-sansUI', 'lg:text-base', 'md:text-base', 'sm:text-s', 'text-shade-dark', 'dark:text-shade-light'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>]
Traceback (most recent call last):
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 93, in <module>
    train_and_scrape()
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 48, in train_and_scrape
    scraper = train_spon_scraper()
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 67, in train_spon_scraper
    scraper = train_scraper(training_set, complexity=5)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 44, in train_scraper
    sample_matches = [
                     ^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <listcomp>
    sorted(s.get_matches(), key=lambda m: m.span)[:100]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <lambda>
    sorted(s.get_matches(), key=lambda m: m.span)[:100]
                                          ^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 131, in span
    return sum(
           ^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 132, in <genexpr>
    m.span + get_relative_depth(m.root, self.root)
    ^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in span
    return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in <genexpr>
    return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/html.py", line 181, in get_relative_depth
    i = node_parents.index(root.soup)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: <a class="text-black dark:text-shade-lightest font-bold border-b border-shade-light hover:border-black dark:hover:border-white" href="https://www.spiegel.de/impressum/autor-1a9752a4-0001-0003-0000-000000020534" target="_self" title="Nike Laurenz">
Nike Laurenz</a> is not in list

antonengelhardt avatar Apr 14 '23 13:04 antonengelhardt