youtube-dl
youtube-dl copied to clipboard
Sport5.co.il is broken
Checklist
- [x] I'm reporting a broken site support
- [x] I've verified that I'm running youtube-dl version 2020.12.12
- [x] I've checked that all provided URLs are alive and playable in a browser
- [x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
- [x] I've searched the bugtracker for similar issues including closed ones
Verbose log
> youtube-dl --version
2020.12.12
> youtube-dl --verbose "https://vod.sport5.co.il/?Vc=10195&Vi=355711"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://vod.sport5.co.il/?Vc=10195&Vi=355711']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.12
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[Sport5] 355711: Downloading webpage
ERROR: Unable to extract video id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/youtube_dl/YoutubeDL.py", line 803, in wrapper
return func(self, *args, **kwargs)
File "/usr/lib/python3.9/site-packages/youtube_dl/YoutubeDL.py", line 824, in __extract_info
ie_result = ie.extract(url)
File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 532, in extract
ie_result = self._real_extract(url)
File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/sport5.py", line 44, in _real_extract
video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id')
File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 1019, in _html_search_regex
res = self._search_regex(pattern, string, name, default, fatal, flags, group)
File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 1010, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract video id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Description
Hey Sport5.co.il extractor is not working. See log attached.
play the video and press f 12 on your keyboard to bring up the debug menu select network & all, type m3u8 in the search box, highlight master.m3u8 right click and copy the url and then feed it to youtube-dl to download the video.
@october262
Using the link I posted above and your method:
Searching for the master.m3u8
in the HTML body will give you this element:
<iframe src="https://playern.sport5.co.il/vidclean/sportsvid/player.html?videoUrl=https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8&Type=vod&ShowRecommended=false&folder_id=808&ShowAdvertisement=True&iOSPosterImage=https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg&useAkamai=true" id="0" width="670px" class="frmVideo" height="378px" scrolling="no" frameborder="0" allow="autoplay" allowfullscreen="" style="height: 378px;"></iframe>
If we take the src
attribute value of the iframe
we'll get:
https://playern.sport5.co.il/vidclean/sportsvid/player.html?videoUrl=https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8&Type=vod&ShowRecommended=false&folder_id=808&ShowAdvertisement=True&iOSPosterImage=https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg&useAkamai=true
And if take the videoUrl
query parameter out of this we'll get:
https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8
And when invoking youtube-dl
on this url it works :smile:
Thanks!
Now we only need to extract the information of the video, this can be easily done using the meta
elements under the head
element in the original HTML body.
i.e.:
<meta name="title" content="הישג נחמד: מיטיולנד סחטה 1:1 מליברפול | ספורט 5 - VOD" />
<meta name="description" content=" אתם מוזמנים לצפות בוידאו: הישג נחמד: מיטיולנד סחטה 1:1 מליברפול באזור ה VOD של ערוץ הספורט - תקצירים, תוכניות וחדשות ספורט במרחק לחיצה > " />
<meta name="keywords" content="כדורגל עולמי, ליגת האלופות, ליברפול" />
<link rel="canonical" href="vod.sport5.co.il" />
<link rel="image_src" href="https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg" />
<link rel="video_src" href="vod.sport5.co.il"/>
<meta name="video_height" content="625" />
<meta name="video_width" content="354" />
<meta name="video_type" content="application/x-shockwave-flash" />
<meta property="og:locale" content="he-IL" />
<meta property="og:url" content="https://vod.sport5.co.il/?Vc=10195&Vi=355711" />
<meta property="og:site_name" content="ערוץ הספורט" />
<meta property="og:type" content="video.movie" />
<meta property="og:video:type" content="application/x-shockwave-flash" />
<meta property="og:video:height" content="625" />
<meta property="og:video:width" content="354" />
<meta property="og:title" content="הישג נחמד: מיטיולנד סחטה 1:1 מליברפול | ספורט 5 - VOD" />
<meta property="og:description" content=" אתם מוזמנים לצפות בוידאו: הישג נחמד: מיטיולנד סחטה 1:1 מליברפול באזור ה VOD של ערוץ הספורט - תקצירים, תוכניות וחדשות ספורט במרחק לחיצה > " />
<meta property="og:video" content="https://vod.sport5.co.il/?Vc=10195&Vi=355711" />
<meta property="og:image" content="https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg" />
<meta property="og:date_published" content="18:58 | 09.12.20" />
If someone want to make a PR and fix it - go ahead. If I'l have time on the weekend I'll take my chance :)
The example page has a ld+json block but we don't find it because utils.JSON_LD_RE
is defective: it doesn't allow for spaces around =
in attr=value
expressions in the HTML script tag. Once we fix that, this patch looks promising:
--- old/youtube_dl/extractor/sport5.py
+++ new/youtube_dl/extractor/sport5.py
@@ -4,7 +4,16 @@
import re
from .common import InfoExtractor
-from ..utils import ExtractorError
+from ..compat import (
+ compat_parse_qs,
+ compat_urllib_parse_urlparse,
+)
+from ..utils import (
+ determine_ext,
+ ExtractorError,
+ NO_DEFAULT,
+ url_or_none,
+)
class Sport5IE(InfoExtractor):
@@ -35,13 +44,7 @@
}
]
- def _real_extract(self, url):
- mobj = re.match(self._VALID_URL, url)
- media_id = mobj.group('id')
-
- webpage = self._download_webpage(url, media_id)
-
- video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id')
+ def _old_extract(self, video_id):
metadata = self._download_xml(
'http://sport5-metadata-rr-d.nsacdn.com/vod/vod/%s/HDS/metadata.xml' % video_id,
@@ -90,3 +93,37 @@
'categories': categories,
'formats': formats,
}
+
+ def _real_extract(self, url):
+ mobj = re.match(self._VALID_URL, url)
+ media_id = mobj.group('id')
+
+ webpage = self._download_webpage(url, media_id)
+
+ video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id', default=None)
+
+ if video_id:
+ return self._old_extract(video_id)
+
+ info = self._search_json_ld(webpage, media_id)
+ if not info.get('title'):
+ info['title'] = self._og_search_title(webpage)
+ qs = compat_parse_qs(compat_urllib_parse_urlparse(info.get('url', '')).query)
+ c_url = url_or_none(qs.get('videoUrl', [''])[-1].split('___', 1)[0])
+ formats = []
+ if c_url:
+ ext = determine_ext(c_url)
+ if ext =='m3u8':
+ formats.extend(self._extract_m3u8_formats(
+ c_url, video_id, 'mp4', 'm3u8_native',
+ m3u8_id='hls', fatal=False))
+ else:
+ formats.append({
+ 'url': c_url,
+ })
+ self._sort_formats(formats)
+ info.update({
+ 'id': media_id,
+ 'formats': formats,
+ })
+ return info
I left the original extraction in case there is still a page where it would work, but otherwise the extraction follows the post above. We don't get duration
or categories
with the new extraction.
$ python -m youtube_dl -v -F 'https://vod.sport5.co.il/?Vc=10195&Vi=355711'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://vod.sport5.co.il/?Vc=10195&Vi=355711']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 2948a02f7
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[Sport5] 355711: Downloading webpage
[Sport5] Downloading m3u8 information
[info] Available formats for 355711:
format code extension resolution note
hls-488 mp4 640x360 488k , avc1.66.30, mp4a.40.2
hls-780 mp4 768x432 780k , avc1.66.30, mp4a.40.2
hls-1170 mp4 1024x576 1170k , avc1.66.30, mp4a.40.2
hls-1847 mp4 1280x720 1847k , avc1.66.30, mp4a.40.2 (best)
$
The replacement JSON_LD_RE
:
r'''(?is)<script\b[^>]+\btype\s*=\s*(["']?)application/ld\+json\1[^>]*>(?P<json_ld>.+?)</script>'''
@dirkf nice work there. I couldn’t get the json-ld regex to work, only when I pretty printed the response html but I thought that’s not the nicest way to do that..
Care to submit a PR?
In my test code I actually copied the _search_json_ld()
method into the extractor and patched the regex into it. The PR will fix the original value, so as to avoid that.
- Existing test 1 has a clip ID as the value of the
videoUrl
query parameter. - Existing test 2 has a video iframe whose
src
link contains a clip ID and leads to a JS player with the code for handling a clip ID. - With code for these two cases in place we can get metadata for these two test URLs, but the actual media links from the m3u8 manifests give 404.
- There doesn't appear to be any geo-blocking of these URLs in the UK.
Both test cases fail at the original site (sport5), an error message shows "video unavailable". I guess it's because they are too old (from 2014). I can provide recent videos as tests if needed.