webpub-manifest Synchronized Narration

This is a first draft of a spec related to Synchronized Narrations, json equivalent of media overlays.

Jun 01 '21 14:06 llemeurfr

I feel that there's an opportunity to generalize the synchronization to any media type.

The spec mentions that we could extend it later by adding more media type-specific properties:

In case we decide to extend the structure to image and video, using image and video would be consistent with the latest work of the W3C CG.

But why not make it media-type agnostic instead? This could support all these use cases from the start:

Small illustrations or sign-language videos explaining words or utterances in a text.
Audio narration over a comic book.
Subtitles over video-based publications.

Even text-on-text synchronization could open interesting possibilities:

Synchronizing a publication and its translation, useful for:
- Displaying the two versions side-by-side, to practice learning a language.
- Displaying an accurate translation of a paragraph, when reading an ebook in a foreign language.
Synchronizing a publication and a commentary, for example to display the notes side by side or in a margin.
- I'm talking about "published author commentary" not user annotations. Think classical texts annotated with explanations for studies.

If we go down that road, "Synchronized Media" might be more accurate than "Narration".

I think for this we just need to:

rename text in source, master or primary
rename audio in secondary or something else
rename narration
add a way to specify the media type of the two resources to know how to interpret the fragments

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

Jun 02 '21 15:06 mickael-menu

"Synchronized Rendition" might be more accurate than "Narration"

"Sync Media" is probably a better choice than "Sync Rendition". The former is already the chosen term for the W3C draft which succeeds "Sync Narration", the latter has been in use since EPUB3 "Multiple Renditions" which has different semantics.

See:

https://w3c.github.io/sync-media-pub/sync-media.html#media-objects

https://w3c.github.io/epub-specs/epub33/multi-rend/

Jun 02 '21 15:06 danielweck

We've had a lot of discussions in the past about media overlays and we really need to dig back in there and read things before we merge any new document at this point.

The following documents are currently in our architecture repo:

https://github.com/readium/architecture/blob/master/models/media-overlay/README.md
https://github.com/readium/architecture/blob/master/models/media-overlay/syntax.md

There are also many issues that mention media overlays: https://github.com/readium/architecture/issues?q=is%3Aissue+media+overlay

But why not make it media-type agnostic instead? This could support all these use cases from the start

I've already made a similar proposal back in 2019, based on our RWPM Link Object, see: https://github.com/readium/architecture/issues/88

I believe that could find a middle ground between this proposal (a Synchronized Media document is essentially based on our code model for RWPM) and something more specialized (similar to what we have in our architecture repo or this PR).

Jun 02 '21 19:06 HadrienGardeur

I always considered alternate to be an alternative rendition of a particular resources. The Synchronized Narration object seems more like an augmentation of the resource.

But maybe I'm wrong on the semantics of alternate?

You're completely right that alternate and Synchronized Media/Media Overlay are very closely related with one another.

There's one major difference between the two of them:

alternate in readingOrder or resources is limited to resource-level alternates
Synchronized Media/Media Overlay operate at a fragment (or sub-resource) level

There are a few other places where we also work with fragments:

guided navigation in Divina: https://readium.org/webpub-manifest/profiles/divina.html#4-guided-navigation
table of contents: https://readium.org/webpub-manifest/#6-table-of-contents
pageList, loi, loa, lov and lot collections in EPUB: https://readium.org/webpub-manifest/profiles/epub.html#3-collection-roles

One could argue that for a Divina with guided navigation, Synchronized Media would not be useful as you could express the same type of information purely with alternate.

This is one of the reason why we need to move beyond the EPUB point of view on this and think about a more generic approach that applies to all fragments across all media.

Jun 02 '21 19:06 HadrienGardeur

There are several items to settle on, that we can tackle in this order I guess:

do we suppress the proposed textRef and audioRef, which have drawbacks described in previous comments?
do we express a notion of "primary" (singular) and "secondary" (plural) resources, which open the path to text-to-text mapping? if yes how?
do we use simplified link objects, with href and type, instead of text+ 'audio properties, and children instead of sub narration?
in this case hat happens to structural semantics, i.e. the role property?
do we move from alternate to a more specific property in the resource referencing the syncnarr structure?
do we replace "sync narration" by "sync media"?

Jun 09 '21 10:06 llemeurfr

This issue might be relevant and shows there's a need for synchronization besides text-to-audio https://github.com/w3c/publishingcg/issues/20

Jul 12 '21 14:07 mickael-menu

Hi,

This issue might be relevant and shows there's a need for synchronization besides text-to-audio w3c/publishingcg#20

I'm the author of that issue and this discussion https://github.com/readium/webpub-manifest/discussions/74 about

Maybe we could make Synchronized Narration for comics and magazines like this:

{
  "imageRef": "images/chapter1.jpeg",
  "audioRef": "audio/chapter1.mp3",
  "narration": [
    {
      "image": "#xywh=percent:5,5,15,15",
      "audio": "#t=0.0,1.2"
    },
    {
      "image": "#xywh=percent:20,20,25,25",
      "audio": "#t=1.2,3.4"
    },
    {
      "image": "#xywh=percent:5,45,30,30",
      "audio": "#t=3.4,5.6"
    }
  ]
}

Or should DiViNa's guided navigation be extended with an audio property?

Something like:

"guided": [
  {
    "href": "http://example.org/page1.jpeg",
    "audio": "http://example.org/page1.mp3#t=0,11",
    "title": "Page 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=0,0,300,200",
    "audio": "http://example.org/page1.mp3#t=11,25",
    "title": "Panel 1"
  },
  {
    "href": "http://example.org/page1.jpeg#xywh=300,200,310,200",
    "audio": "http://example.org/page1.mp3#t=25,102",
    "title": "Panel 2"
  }
]

I don't like the name audio but couldn't come up with something better. I'm also a worried it is too verbose, generating larger than needed json-files.

Sep 27 '21 20:09 m-abs

@m-abs Did you read this https://github.com/readium/webpub-manifest/pull/83#issuecomment-853111694? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

Sep 28 '21 08:09 mickael-menu

@m-abs Did you read this #83 (comment)? We are considering making the JSON media-type agnostic to also work with images. The draft spec is not up to date with this yet.

Sorry, I must have missed it last night.

#83 (comment):

rename text in source, master or primary

rename audio in secondary or something else

Could there be a use-case there one would need more than just the two sources? Maybe replace text and audio with links an array of Link objects or href strings?

It could be mixing two texts as suggested and background music or a comic book frame + a speech bubble text + background music or a way to implement #49

I'm also in favor of having full hrefs in the narration items, because this enables to have several secondary resources for a single reading order item. This would be important for use cases like the illustrations or sign language videos.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc. (This is a problem, we've had to deal with in our old app with our private format).

Sep 28 '21 09:09 m-abs

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc. (This is a problem, we've had to deal with in our old app with our private format).

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

Sep 28 '21 10:09 HadrienGardeur

@m-abs What was your approach regarding resources? A single mapping document for the whole publication or one per ressource?

A single document for the whole publication.

We made a dump of the structure of DAISY 2.02 audiobooks (both with and without text) to our private JSON format in a single file. This JSON contains a resource map from the local path/uri to server path.

Some of these JSON files became very big and caused problems with our client app and for our users.

Sep 28 '21 11:09 m-abs

Could there be a use-case there one would need more than just the two sources? Maybe replace text and audio with links an array of Link objects or href strings?

I guess that would be fine, to have either a single link or a link array in secondary. As long as we have only a single primary/leading resource.

I'm worried this could result in very large JSON files for large books with many sentences/paragraphs/etc. (This is a problem, we've had to deal with in our old app with our private format).

By full HREFs I meant having the path of the resource relative to the self link, not necessarily a full URL. For example:

#xywh=percent:5,5,15,15 -> images/one.png#xywh=percent:5,5,15,15

Sep 28 '21 12:09 mickael-menu

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

Feb 22 '22 22:02 alexwhb

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired. It would also be useful in our use-case to have the ability to stream audio instead of playing downloaded media. Just some thoughts for potentially making the implementation more flex for these types of use cases.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU. The few options I considered where:

Just have function get triggered say every 300MS or so and check if the current word that's highlighted is still in the range of the media overlay timecode... if so do nothing if not find the next one in the range. This solution is okay but can feel a bit laggy at times especially if you are doing a word by word highlight. Obviously you can turn down the delay, but then you start eating more CPU
The second option I thought of was doing a PostDelay with a runnable, where the delay is always equal to the word duration, so you get called back when the current word should no longer be highlighted. This avoids using a lot of CPU since you are not adding any additional overhead that the message loop is not already incurring. The issue with this implementation is that there is a time drift that happens using this method because of intrinsic delay with message handling and executing functions. After using this for about 30 seconds or maybe a minute you definitely notice the synchronization getting off. So ideally you could maybe use this above method but have some sort of calculation to see how much internal timing error there is and adjust for that.

Any thoughts on the above notes?

Feb 22 '22 23:02 alexwhb

Just for clarity sake since I'm late to the party here... what is the sync file? we currently use .smil at my org for our ebooks. If we define some other format I wonder if we could make some sort of easily swapped interface so that you could easily interpret other synchronization formats.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

Also on a more technical note, and correct me if this is the wrong place to discuss this, but I wonder about the best way to resolve connected media such as audio, since packaging ebooks with potentially lots of audio files is not ideal since your packaged EPUB could end up being several GB of data if it's a longer book. I think it'd be advisable design a system that allows us to resolve audio/media files stored outside the ePub if so desired.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

One other thought I had... I got a basic implementation working abut a year back where I could highlight the word that was playing etc. One issue I ran into though was how to keep the timecode check in sync with the playing audio without utilizing too much CPU.

I'll start working on TTS next week and have more intel then, but:

As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?
I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling. EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)

Feb 23 '22 11:02 mickael-menu

@mickael-menu Fantastic stuff. Thanks for the prompt response.

Right, this is the goal. As RWPM is an exchange format that can represent any type of publication in a Readium app, the Sync file would be a JSON representation of SMIL or other type of synchronization formats.

This makes a lot more sense. Thanks for the clarification.

As far as I know nothing prevents to mix local and remote resources in a RWPM, and so the sync file could reference remote resources.

This is very good to know. I was actually not aware of that

As far as I know, in SMIL there's no notion of word-by-word? It just highlights portion of the HTML with IDs. Is it something you want to add on top of the default Media Overlays rendering?

You are totally correct here. The SMIL spec just lets you define IDs that you'd like to highlight. In our case we have a proprietary parser that wraps every word in our ebooks with with a unique ID for each word and we then can reference those word id's in the SMIL file and link them up to the timecode where they appeared in their respective audio files. Obviously the word level is arbitrary and could be swiped out for a sentence or paragraph if desired or an arbitrary range.

I do wonder if there'd be a simple way to change up the overlay style from word to sentence to paragraph if our books contained the ID's for all three types and the timecode ranges. That'd be a feature I'd really like, but I've not spent enough time looking at the SMIL spec to see if that's currently supported.

I don't know for Kotlin, but on Swift you can get a callback when the TTS engine is about to speak a word. This is what you would use to highlight a word without relying on polling. EDIT: I found this similar API on Android: https://developer.android.com/reference/kotlin/android/speech/tts/UtteranceProgressListener#onRangeStart(kotlin.String,%20kotlin.Int,%20kotlin.Int,%20kotlin.Int)

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Really excited to see the TTS stuff.

Feb 23 '22 18:02 alexwhb

This is a really interesting idea. I had not thought about that. I definitely see how this would be useful for TTS, but am not fully seeing how this would work with media overlays? It's too bad it doesn't take a timecode range or our jobs would be done. 😂

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

Feb 23 '22 18:02 mickael-menu

Ha yes, I was focusing on TTS! Not sure you can do better in your case, unless you have the timestamp of each individual word in the audio file. Runtime word-by-word synchronization is out of scope for Readium though, in the context of synchronized narration.

I do have the timestamps of each word. 🤩 . I'll send you an example section of one of our SMIL files as a reference in a bit. I'll dig one up.

Also just found the implementation on the JS side. Maybe we can glean some incite from it.

Also I know you are busy with other things so don't let me distract you. 😄 I'm kinda thinking out-loud here.

Feb 23 '22 18:02 alexwhb

Also see: https://github.com/readium/architecture/pull/181

Feb 27 '23 14:02 danielweck

Superseded by https://github.com/readium/webpub-manifest/pull/95 ?

Mar 20 '23 14:03 danielweck

webpub-manifest webpub-manifest copied to clipboard

Synchronized Narration

webpub-manifest
webpub-manifest copied to clipboard