youtube-dl icon indicating copy to clipboard operation
youtube-dl copied to clipboard

Sport5.co.il is broken

Open Ghost93 opened this issue 4 years ago • 7 comments

Checklist

  • [x] I'm reporting a broken site support
  • [x] I've verified that I'm running youtube-dl version 2020.12.12
  • [x] I've checked that all provided URLs are alive and playable in a browser
  • [x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • [x] I've searched the bugtracker for similar issues including closed ones

Verbose log

> youtube-dl --version
2020.12.12

> youtube-dl --verbose "https://vod.sport5.co.il/?Vc=10195&Vi=355711"
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://vod.sport5.co.il/?Vc=10195&Vi=355711']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.12
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[Sport5] 355711: Downloading webpage
ERROR: Unable to extract video id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/youtube_dl/YoutubeDL.py", line 803, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/youtube_dl/YoutubeDL.py", line 824, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 532, in extract
    ie_result = self._real_extract(url)
  File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/sport5.py", line 44, in _real_extract
    video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id')
  File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 1019, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/usr/lib/python3.9/site-packages/youtube_dl/extractor/common.py", line 1010, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract video id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

Hey Sport5.co.il extractor is not working. See log attached.

Ghost93 avatar Dec 13 '20 09:12 Ghost93

play the video and press f 12 on your keyboard to bring up the debug menu select network & all, type m3u8 in the search box, highlight master.m3u8 right click and copy the url and then feed it to youtube-dl to download the video.

october262 avatar Dec 13 '20 21:12 october262

@october262 Using the link I posted above and your method: Searching for the master.m3u8 in the HTML body will give you this element:

<iframe src="https://playern.sport5.co.il/vidclean/sportsvid/player.html?videoUrl=https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8&amp;Type=vod&amp;ShowRecommended=false&amp;folder_id=808&amp;ShowAdvertisement=True&amp;iOSPosterImage=https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg&amp;useAkamai=true" id="0" width="670px" class="frmVideo" height="378px" scrolling="no" frameborder="0" allow="autoplay" allowfullscreen="" style="height: 378px;"></iframe>

If we take the src attribute value of the iframe we'll get:

https://playern.sport5.co.il/vidclean/sportsvid/player.html?videoUrl=https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8&amp;Type=vod&amp;ShowRecommended=false&amp;folder_id=808&amp;ShowAdvertisement=True&amp;iOSPosterImage=https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg&amp;useAkamai=true

And if take the videoUrl query parameter out of this we'll get:

https://rgesport5-vh.akamaihd.net/i/bynet/sport5/sport5/PRV5/HSDioIMoGN/App/NM_VTR_TAK_MIDTJYLLAND_VS_LIVERPOOL_091220_,400,700,1100,1800,.mp4.csmil/master.m3u8

And when invoking youtube-dl on this url it works :smile:

Thanks!

Now we only need to extract the information of the video, this can be easily done using the meta elements under the head element in the original HTML body. i.e.:

    
<meta name="title" content="הישג נחמד: מיטיולנד סחטה 1:1 מליברפול | ספורט 5 - VOD" />
<meta name="description" content=" אתם מוזמנים לצפות בוידאו: הישג נחמד: מיטיולנד סחטה 1:1 מליברפול באזור ה VOD של ערוץ הספורט - תקצירים, תוכניות וחדשות ספורט במרחק לחיצה > " />
<meta name="keywords" content="כדורגל עולמי, ליגת האלופות, ליברפול" />
<link rel="canonical" href="vod.sport5.co.il" />
<link rel="image_src" href="https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg" />
<link rel="video_src" href="vod.sport5.co.il"/>
<meta name="video_height" content="625" />
<meta name="video_width" content="354" />
<meta name="video_type" content="application/x-shockwave-flash" />
<meta property="og:locale" content="he-IL" />
<meta property="og:url" content="https://vod.sport5.co.il/?Vc=10195&Vi=355711" />
<meta property="og:site_name" content="ערוץ הספורט" />
<meta property="og:type" content="video.movie" />
<meta property="og:video:type" content="application/x-shockwave-flash" />
<meta property="og:video:height" content="625" />
<meta property="og:video:width" content="354" />
<meta property="og:title" content="הישג נחמד: מיטיולנד סחטה 1:1 מליברפול | ספורט 5 - VOD" />
<meta property="og:description" content=" אתם מוזמנים לצפות בוידאו: הישג נחמד: מיטיולנד סחטה 1:1 מליברפול באזור ה VOD של ערוץ הספורט - תקצירים, תוכניות וחדשות ספורט במרחק לחיצה > " />
<meta property="og:video" content="https://vod.sport5.co.il/?Vc=10195&Vi=355711" />
<meta property="og:image" content="https://www.sport5.co.il/Sip_Storage/FILES/2/1058832.jpg" />

<meta property="og:date_published" content="18:58 | 09.12.20" />

If someone want to make a PR and fix it - go ahead. If I'l have time on the weekend I'll take my chance :)

Ghost93 avatar Dec 14 '20 07:12 Ghost93

The example page has a ld+json block but we don't find it because utils.JSON_LD_RE is defective: it doesn't allow for spaces around = in attr=value expressions in the HTML script tag. Once we fix that, this patch looks promising:

--- old/youtube_dl/extractor/sport5.py
+++ new/youtube_dl/extractor/sport5.py
@@ -4,7 +4,16 @@
 import re
 
 from .common import InfoExtractor
-from ..utils import ExtractorError
+from ..compat import (
+    compat_parse_qs,
+    compat_urllib_parse_urlparse,
+)
+from ..utils import (
+    determine_ext,
+    ExtractorError,
+    NO_DEFAULT,
+    url_or_none,
+)
 
 
 class Sport5IE(InfoExtractor):
@@ -35,13 +44,7 @@
         }
     ]
 
-    def _real_extract(self, url):
-        mobj = re.match(self._VALID_URL, url)
-        media_id = mobj.group('id')
-
-        webpage = self._download_webpage(url, media_id)
-
-        video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id')
+    def _old_extract(self, video_id):
 
         metadata = self._download_xml(
             'http://sport5-metadata-rr-d.nsacdn.com/vod/vod/%s/HDS/metadata.xml' % video_id,
@@ -90,3 +93,37 @@
             'categories': categories,
             'formats': formats,
         }
+
+    def _real_extract(self, url):
+        mobj = re.match(self._VALID_URL, url)
+        media_id = mobj.group('id')
+
+        webpage = self._download_webpage(url, media_id)
+
+        video_id = self._html_search_regex(r'clipId=([\w-]+)', webpage, 'video id', default=None)
+
+        if video_id:
+            return self._old_extract(video_id)
+
+        info = self._search_json_ld(webpage, media_id)
+        if not info.get('title'):
+            info['title'] = self._og_search_title(webpage)
+        qs = compat_parse_qs(compat_urllib_parse_urlparse(info.get('url', '')).query)
+        c_url = url_or_none(qs.get('videoUrl', [''])[-1].split('___', 1)[0])
+        formats = []
+        if c_url:
+            ext = determine_ext(c_url)
+            if ext =='m3u8':
+                formats.extend(self._extract_m3u8_formats(
+                    c_url, video_id, 'mp4', 'm3u8_native',
+                    m3u8_id='hls', fatal=False))
+            else:
+                formats.append({
+                    'url': c_url,
+                })
+        self._sort_formats(formats)
+        info.update({
+            'id': media_id,
+            'formats': formats,
+        })
+        return info

I left the original extraction in case there is still a page where it would work, but otherwise the extraction follows the post above. We don't get duration or categories with the new extraction.

$ python -m youtube_dl -v -F 'https://vod.sport5.co.il/?Vc=10195&Vi=355711'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://vod.sport5.co.il/?Vc=10195&Vi=355711']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 2948a02f7
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[Sport5] 355711: Downloading webpage
[Sport5] Downloading m3u8 information
[info] Available formats for 355711:
format code  extension  resolution note
hls-488      mp4        640x360     488k , avc1.66.30, mp4a.40.2
hls-780      mp4        768x432     780k , avc1.66.30, mp4a.40.2
hls-1170     mp4        1024x576   1170k , avc1.66.30, mp4a.40.2
hls-1847     mp4        1280x720   1847k , avc1.66.30, mp4a.40.2 (best)
$

The replacement JSON_LD_RE:

r'''(?is)<script\b[^>]+\btype\s*=\s*(["']?)application/ld\+json\1[^>]*>(?P<json_ld>.+?)</script>'''

dirkf avatar Oct 08 '22 19:10 dirkf

@dirkf nice work there. I couldn’t get the json-ld regex to work, only when I pretty printed the response html but I thought that’s not the nicest way to do that..

Care to submit a PR?

Ghost93 avatar Oct 08 '22 20:10 Ghost93

In my test code I actually copied the _search_json_ld() method into the extractor and patched the regex into it. The PR will fix the original value, so as to avoid that.

dirkf avatar Oct 08 '22 23:10 dirkf

  1. Existing test 1 has a clip ID as the value of the videoUrl query parameter.
  2. Existing test 2 has a video iframe whose src link contains a clip ID and leads to a JS player with the code for handling a clip ID.
  3. With code for these two cases in place we can get metadata for these two test URLs, but the actual media links from the m3u8 manifests give 404.
  4. There doesn't appear to be any geo-blocking of these URLs in the UK.

dirkf avatar Oct 09 '22 04:10 dirkf

Both test cases fail at the original site (sport5), an error message shows "video unavailable". I guess it's because they are too old (from 2014). I can provide recent videos as tests if needed.

Ghost93 avatar Oct 09 '22 08:10 Ghost93