extruct
extruct copied to clipboard
Crash on JSONDecodeError from body of YouTube page
I have some code to pull metadata from YouTube
response = requests.get(video_url)
metadata = extruct.extract(response.text, base_url="https://youtube.com")
Have noticed some recent crashing, but only on some videos.
No crash: https://www.youtube.com/watch?v=ZY48KUAZKhM https://www.youtube.com/watch?v=ZlVI7YJGHq0 Crash: https://www.youtube.com/watch?v=987wzJ2NHBE https://www.youtube.com/watch?v=0-EF60neguk
Common factor among those that crash is apostrophes in the channel name!
Traceback (most recent call last):
File "/home/will/local/breda/src/dredger/ingest/tests/test_youtube.py", line 72, in test_one
youtube.get_video_data("https://www.youtube.com/watch?v=987wzJ2NHBE")
File "/home/will/local/breda/src/dredger/ingest/youtube.py", line 46, in get_video_data
metadata = extruct.extract(response.text, base_url="https://youtube.com")
File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/_extruct.py", line 108, in extract
output[syntax] = list(extract(document, base_url=base_url))
File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in extract_items
return [
File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in <listcomp>
return [
File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 38, in _extract_items
data = jstyleson.loads(HTML_OR_JS_COMMENTLINE.sub('', script),strict=False)
File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/jstyleson.py", line 123, in loads
return json.loads(dispose(text), **kwargs)
File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 211 (char 210)
Haven't had a chance today to dig into much beyond triaging the above.
I haven't been able to replicate the issue. Your Crash video links point to the video that has been removed. Maybe that is the reason why you are getting this error. I suggest you check the video links before passing them to the extract. Here is the code that I used:
Code: import extruct import requests from w3lib.html import get_base_url
crash_links=['https://www.youtube.com/watch?v=987wzJ2NHBE','https://www.youtube.com/watch?v=0-EF60neguk']
for video_url in crash_links: response = requests.get(video_url) base_url = get_base_url(response.text, response.url) metadata=extruct.extract(response.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph']) print(metadata)
Output: {'microdata': [], 'json-ld': [], 'opengraph': []} {'microdata': [], 'json-ld': [], 'opengraph': []}
I replicated the issue using these YouTube links, https://www.youtube.com/watch?v=-J2e8OlBdPs, https://www.youtube.com/watch?v=qP07oyFTRXc, https://www.youtube.com/watch?v=BUrnfkxwozM.
As @wjdp suggested, it is because of the apostrophe in the channel name. json.loads() throws an error when the input contains hex codes like "\x27" (which is the apostrophe). I created a pull request #195 where I replace the hex code with the special characters themselves before passing to the json.loads() function.