webpub-manifest
webpub-manifest copied to clipboard
Synchronized Narration
This is a first draft of a spec related to Synchronized Narrations, json equivalent of media overlays.
I feel that there's an opportunity to generalize the synchronization to any media type.
The spec mentions that we could extend it later by adding more media type-specific properties:
In case we decide to extend the structure to image and video, using
imageandvideowould be consistent with the latest work of the W3C CG.
But why not make it media-type agnostic instead? This could support all these use cases from the start:
- Small illustrations or sign-language videos explaining words or utterances in a text.
- Audio narration over a comic book.
- Subtitles over video-based publications.
Even text-on-text synchronization could open interesting possibilities:
- Synchronizing a publication and its translation, useful for:
- Displaying the two versions side-by-side, to practice learning a language.
- Displaying an accurate translation of a paragraph, when reading an ebook in a foreign language.
- Synchronizing a publication and a commentary, for example to display the notes side by side or in a margin.
- I'm talking about "published author commentary" not user annotations. Think classical texts annotated with explanations for studies.
If we go down that road, "Synchronized Media" might be more accurate than "Narration".
I think for this we just need to:
- rename
textinsource,masterorprimary - rename
audioinsecondaryor something else - rename
narration - add a way to specify the media type of the two resources to know how to interpret the fragments
I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.
"Synchronized Rendition" might be more accurate than "Narration"
"Sync Media" is probably a better choice than "Sync Rendition". The former is already the chosen term for the W3C draft which succeeds "Sync Narration", the latter has been in use since EPUB3 "Multiple Renditions" which has different semantics.
See:
https://w3c.github.io/sync-media-pub/sync-media.html#media-objects
https://w3c.github.io/epub-specs/epub33/multi-rend/
We've had a lot of discussions in the past about media overlays and we really need to dig back in there and read things before we merge any new document at this point.
The following documents are currently in our architecture repo:
- https://github.com/readium/architecture/blob/master/models/media-overlay/README.md
- https://github.com/readium/architecture/blob/master/models/media-overlay/syntax.md
There are also many issues that mention media overlays: https://github.com/readium/architecture/issues?q=is%3Aissue+media+overlay
But why not make it media-type agnostic instead? This could support all these use cases from the start
I've already made a similar proposal back in 2019, based on our RWPM Link Object, see: https://github.com/readium/architecture/issues/88
I believe that could find a middle ground between this proposal (a Synchronized Media document is essentially based on our code model for RWPM) and something more specialized (similar to what we have in our architecture repo or this PR).
I always considered
alternateto be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.But maybe I'm wrong on the semantics of
alternate?
You're completely right that alternate and Synchronized Media/Media Overlay are very closely related with one another.
There's one major difference between the two of them:
alternateinreadingOrderorresourcesis limited to resource-level alternates- Synchronized Media/Media Overlay operate at a fragment (or sub-resource) level
There are a few other places where we also work with fragments:
- guided navigation in Divina: https://readium.org/webpub-manifest/profiles/divina.html#4-guided-navigation
- table of contents: https://readium.org/webpub-manifest/#6-table-of-contents
pageList,loi,loa,lovandlotcollections in EPUB: https://readium.org/webpub-manifest/profiles/epub.html#3-collection-roles
One could argue that for a Divina with guided navigation, Synchronized Media would not be useful as you could express the same type of information purely with alternate.
This is one of the reason why we need to move beyond the EPUB point of view on this and think about a more generic approach that applies to all fragments across all media.
There are several items to settle on, that we can tackle in this order I guess:
- do we suppress the proposed
textRefandaudioRef, which have drawbacks described in previous comments? - do we express a notion of "primary" (singular) and "secondary" (plural) resources, which open the path to text-to-text mapping? if yes how?
- do we use simplified link objects, with href and type, instead of
text+'audioproperties, andchildreninstead of sub narration? - in this case hat happens to structural semantics, i.e. the
roleproperty? - do we move from
alternateto a more specific property in the resource referencing the syncnarr structure? - do we replace "sync narration" by "sync media"?
This issue might be relevant and shows there's a need for synchronization besides text-to-audio https://github.com/w3c/publishingcg/issues/20
Hi,
This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20
I'm the author of that issue and this discussion https://github.com/readium/webpub-manifest/discussions/74 about
Maybe we could make Synchronized Narration for comics and magazines like this:
{
"imageRef": "images/chapter1.jpeg",
"audioRef": "audio/chapter1.mp3",
"narration": [
{
"image": "#xywh=percent:5,5,15,15",
"audio": "#t=0.0,1.2"
},
{
"image": "#xywh=percent:20,20,25,25",
"audio": "#t=1.2,3.4"
},
{
"image": "#xywh=percent:5,45,30,30",
"audio": "#t=3.4,5.6"
}
]
}
Or should DiViNa's guided navigation be extended with an audio property?
Something like:
"guided": [
{
"href": "http://example.org/page1.jpeg",
"audio": "http://example.org/page1.mp3#t=0,11",
"title": "Page 1"
},
{
"href": "http://example.org/page1.jpeg#xywh=0,0,300,200",
"audio": "http://example.org/page1.mp3#t=11,25",
"title": "Panel 1"
},
{
"href": "http://example.org/page1.jpeg#xywh=300,200,310,200",
"audio": "http://example.org/page1.mp3#t=25,102",
"title": "Panel 2"
}
]
I don't like the name audio but couldn't come up with something better.
I'm also a worried it is too verbose, generating larger than needed json-files.
@m-abs Did you read this https://github.com/readium/webpub-manifest/pull/83#issuecomment-853111694? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.
@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.
Sorry, I must have missed it last night.
- rename
textinsource,masterorprimary- rename
audioinsecondaryor something else
Could there be a use-case there one would need more than just the two sources?
Maybe replace text and audio with links an array of Link objects or href strings?
It could be mixing two texts as suggested and background music or a comic book frame + a speech bubble text + background music or a way to implement #49
I'm also in favor of having full hrefs in the
narrationitems, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.
I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc.
(This is a problem, we've had to deal with in our old app with our private format).
I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc. (This is a problem, we've had to deal with in our old app with our private format).
@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?
@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?
A single document for the whole publication.
We made a dump of the structure of DAISY 2.02 audiobooks (both with and without text) to our private JSON format in a single file.
This JSON contains a resource map from the local path/uri to server path.
Some of these JSON files became very big and caused problems with our client app and for our users.
Could there be a use-case there one would need more than just the two sources? Maybe replace
textandaudiowithlinksan array ofLinkobjects orhrefstrings?
I guess that would be fine, to have either a single link or a link array in secondary. As long as we have only a single primary/leading resource.
I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc. (This is a problem, we've had to deal with in our old app with our private format).
By full HREFs I meant having the path of the resource relative to the self link, not necessarily a full URL. For example:
#xywh=percent:5,5,15,15 -> images/one.png#xywh=percent:5,5,15,15
Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.
Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired. It would also be useful in our use-case to have the ability to stream audio instead of playing downloaded media. Just some thoughts for potentially making the implementation more flex for these types of use cases.
One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU. The few options I considered where:
-
Just have function get triggered say every 300MS or so and check if the current word that's highlighted is still in the range of the media overlay timecode... if so do nothing if not find the next one in the range. This solution is okay but can feel a bit laggy at times especially if you are doing a word by word highlight. Obviously you can turn down the delay, but then you start eating more CPU
-
The second option I thought of was doing a PostDelay with a runnable, where the delay is always equal to the word duration, so you get called back when the current word should no longer be highlighted. This avoids using a lot of CPU since you are not adding any additional overhead that the message loop is not already incurring. The issue with this implementation is that there is a time drift that happens using this method because of intrinsic delay with message handling and executing functions. After using this for about 30 seconds or maybe a minute you definitely notice the synchronization getting off. So ideally you could maybe use this above method but have some sort of calculation to see how much internal timing error there is and adjust for that.
Any thoughts on the above notes?
Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.
Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.
Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired.
As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.
One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU.
I'll start working on TTS next week and have more intel then, but:
- As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?
- I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling. EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)
@mickael-menu Fantastic stuff. Thanks for the prompt response.
Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.
This makes a lot more sense. Thanks for the clarification.
As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.
This is very good to know. I was actually not aware of that
As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?
You are totally correct here. The SMIL spec just lets you define IDs that you'd like to highlight. In our case we have a proprietary parser that wraps every word in our ebooks with
I do wonder if there'd be a simple way to change up the overlay style from word to sentence to paragraph if our books contained the ID's for all three types and the timecode ranges. That'd be a feature I'd really like, but I've not spent enough time looking at the SMIL spec to see if that's currently supported.
I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling. EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)
This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂
Really excited to see the TTS stuff. Â
This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂
Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.
Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.
I do have the timestamps of each word. 🤩 . I'll send you an example section of one of our SMIL files as a reference in a bit. I'll dig one up.
Also just found the implementation on the JS side. Maybe we can glean some incite from it.
Also I know you are busy with other things so don't let me distract you. 😄 I'm kinda thinking out-loud here.
Also see: https://github.com/readium/architecture/pull/181
Superseded by https://github.com/readium/webpub-manifest/pull/95 ?