mlscraper
mlscraper copied to clipboard
Spiegel example from Gist
@lorey Here you go. Can you please check if this runs on your system?
When i run it, i get an error.
Thanks for adding.
What's the error?
@lorey These are just the last few lines. Do you want me to post the whole output?
INFO:root:found len(value_matches)=2 on page (self.value='24.06.2022, 14.26 Uhr', self.page=<Page self.soup.name='[document]' classes=None, text=Kristina H...>)
INFO:root:value_matches=[<ValueMatch self.node=<Node self.soup.name='time' classes=['timeformat'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>, <ValueMatch self.node=<Node self.soup.name='div' classes=['font-sansUI', 'lg:text-base', 'md:text-base', 'sm:text-s', 'text-shade-dark', 'dark:text-shade-light'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>]
Traceback (most recent call last):
File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 93, in <module>
train_and_scrape()
File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 48, in train_and_scrape
scraper = train_spon_scraper()
^^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 67, in train_spon_scraper
scraper = train_scraper(training_set, complexity=5)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 44, in train_scraper
sample_matches = [
^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <listcomp>
sorted(s.get_matches(), key=lambda m: m.span)[:100]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <lambda>
sorted(s.get_matches(), key=lambda m: m.span)[:100]
^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 131, in span
return sum(
^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 132, in <genexpr>
m.span + get_relative_depth(m.root, self.root)
^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in span
return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in <genexpr>
return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/html.py", line 181, in get_relative_depth
i = node_parents.index(root.soup)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: <a class="text-black dark:text-shade-lightest font-bold border-b border-shade-light hover:border-black dark:hover:border-white" href="https://www.spiegel.de/impressum/autor-1a9752a4-0001-0003-0000-000000020534" target="_self" title="Nike Laurenz">
Nike Laurenz</a> is not in list