feedparser
feedparser copied to clipboard
Issues with the Media-RSS implementation
Hello, I noticed some issues with the media-rss implementation. Before trying to fix them, I would like to discuss it here.
media:group is ignored
According to the Media-RSS specification, the <media:group>
tag is used to group several links/representation for a same media. However, my understanding is that feedparser just ignores this tag, and consider every <media:content>
as a new media.
It allows grouping of media:content elements that are effectively the same content, yet different representations. For instance: the same song recorded in both the WAV and MP3 format. It's an optional element that must only be used for this purpose.
https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L64-L66 https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L119-L122
The description is set on the feed entry
The <media:description> tag belongs to the media, but feedparser updates the feed entry description.
https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L91-L95
Some tags are missing
For instance, the <media:subtitle> tag is not handled by feedparser.
Attributes are ignored
When tags are handled, a lot of the attributes in the Media-RSS specification are just ignored. For instance, <media:description>
can either be plain text or html but feedreader does not make a difference.
So...
I would like to tackle this issues, but there could be some backward compatibility problems. How can I manage this? I believe Media-RSS is not much used, and the simpler option for me is just to break the compatibility so feedparser can correctly respect the specification. What do you think?
Could you please give us a short description about what MediaRSS is for. Maybe a real use case would improve the understanding.
Of course. Media-RSS is used to describe medias, such as audio or video files, and their metadata (thumbnails, description, number of views/listening, rating, links to read the media in different format etc.)
It is used in every youtube feeds (example) or peertube feeds (example though support should improve in an upcoming version).
I have the same issue , did you solve it?
Actually this would take some time to fix. I am willing to do a patch, but I would like to be sure that it will merged in the end before I start.
@kurtmckee What do you think?
This is something we are very interested in as well, especially when it comes to children in media:content
, such as media:title
(i.e. associating e.g. image titles with the images themselves).
I have started work on a patch but the changes are breaking at this time (see example below).
Main changes:
-
media:group
(not part of below example) andmedia:content
are now containers as expected.media:group
may containmedia:content
s. -
media:{x}
now generatesmedia_{x}
keys instead of{x}
keys. The keys previously known asmedia_{x}
are now known asmedia_{x}_details
(this is mainly to make tags distinguishable from attributes of the parentmedia:{x}
) -
media:title
is no longer used as a fallback for a missingtitle
(consequence of 2. above. Fixable but probably violating expectations?)
Any thoughts on these changes and how they affect the parsed data?
@azmeuk Is this in line with what you had in mind or were you planning on something different?
@kurtmckee Is this in line with the project as a whole?
Input file
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/"
xmlns:dcterms="http://purl.org/dc/terms/">
<channel>
<title>Music Videos 101</title>
<link>http://www.foo.com</link>
<description>Discussions of great videos</description>
<item>
<title>The latest video from an artist</title>
<link>http://www.foo.com/item1.htm</link>
<media:content url="http://www.foo.com/movie.mov" fileSize="12216320" type="video/quicktime" expression="full">
<media:player url="http://www.foo.com/player?id=1111" height="200" width="400" />
<media:hash algo="md5">dfdec888b72151965a34b4b59031290a</media:hash>
<media:credit role="producer">producer's name</media:credit>
<media:credit role="artist">artist's name</media:credit>
<media:category scheme="http://blah.com/scheme">
music/artistname/album/song
</media:category>
<media:text type="plain">
Oh, say, can you see, by the dawn's early light
</media:text>
<media:rating>nonadult</media:rating>
<dcterms:valid>
start=2002-10-13T09:00+01:00;
end=2002-10-17T17:00+01:00;
scheme=W3C-DTF
</dcterms:valid>
</media:content>
</item>
</channel>
</rss>
Parsed data WITHOUT changes
[
{
"title": "The latest video from an artist",
"title_detail": {
"type": "text/plain",
"language": null,
"base": "",
"value": "The latest video from an artist"
},
"links": [
{
"rel": "alternate",
"type": "text/html",
"href": "http://www.foo.com/item1.htm"
}
],
"link": "http://www.foo.com/item1.htm",
"media_content": [
{
"url": "http://www.foo.com/movie.mov",
"filesize": "12216320",
"type": "video/quicktime",
"expression": "full"
}
],
"media_player": {
"url": "http://www.foo.com/player?id=1111",
"height": "200",
"width": "400",
"content": ""
},
"media_hash": {
"algo": "md5"
},
"media_credit": [
{
"role": "producer",
"content": "producer's name"
},
{
"role": "artist",
"content": "artist's name"
}
],
"credit": "artist's name",
"tags": [
{
"term": "music/artistname/album/song",
"scheme": "http://blah.com/scheme",
"label": null
}
],
"media_text": {
"type": "plain"
},
"media_rating": {
"content": "nonadult"
},
"rating": "nonadult",
"validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF",
"validity_start": "2002-10-13T09:00+01:00",
"validity_start_parsed": [
2002,
10,
13,
8,
0,
0,
6,
286,
0
]
}
]
Parsed data WITH changes
[
{
"title": "The latest video from an artist",
"title_detail": {
"type": "text/plain",
"language": null,
"base": "",
"value": "The latest video from an artist"
},
"links": [
{
"rel": "alternate",
"type": "text/html",
"href": "http://www.foo.com/item1.htm"
}
],
"link": "http://www.foo.com/item1.htm",
"media_content": [
{
"url": "http://www.foo.com/movie.mov",
"filesize": "12216320",
"type": "video/quicktime",
"expression": "full",
"media_player": {
"url": "http://www.foo.com/player?id=1111",
"height": "200",
"width": "400",
"content": ""
},
"media_hash": {
"algo": "md5"
},
"media_credit_details": [
{
"role": "producer",
"content": "producer's name"
},
{
"role": "artist",
"content": "artist's name"
}
],
"media_credit": "artist's name",
"tags": [
{
"term": "music/artistname/album/song",
"scheme": "http://blah.com/scheme",
"label": null
}
],
"media_text": {
"type": "plain"
},
"media_rating_details": {
"content": "nonadult"
},
"media_rating": "nonadult",
"validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF",
"validity_start": "2002-10-13T09:00+01:00",
"validity_start_parsed": [
2002,
10,
13,
8,
0,
0,
6,
286,
0
]
}
]
}
]
Output diff
...
"media_content": [
{
"url": "http://www.foo.com/movie.mov",
"filesize": "12216320",
"type": "video/quicktime",
- "expression": "full"
- }
- ],
+ "expression": "full",
"media_player": {
"url": "http://www.foo.com/player?id=1111",
"height": "200",
...
- "media_credit": [
+ "media_credit_details": [
{
"role": "producer",
"content": "producer's name"
},
{
"role": "artist",
"content": "artist's name"
}
],
- "credit": "artist's name",
+ "media_credit": "artist's name",
...
"media_text": {
"type": "plain"
},
- "media_rating": {
+ "media_rating_details": {
"content": "nonadult"
},
- "rating": "nonadult",
+ "media_rating": "nonadult",
"validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF",
...
+ }
+]