architecture
architecture copied to clipboard
Expressing Locator ranges for highlights
In the context of building the Highlight Navigator API, we need a way to express a selection range with Locators
.
We can't use a single Locator (as the model is now) because:
- not all the fragment types can express a range,
- a range can span two documents (eg. FXL spreads) so we need two hrefs
Using two Locators
is not great because the selection text (LocatorText
) is duplicated and it's not obvious which one should be used.
A potential solution would be to have an additional LocatorRange
object that contains two Locators
with empty .text
and an additional "standalone" LocatorText
that represents the content of the range.
LocatorRange {
start: Locator
end: Locator
text: LocatorText?
}
The .text
is optional because not necessary when passing the Highlight
to the navigator, and for protected book it would be empty (to avoid bypassing the copy rights).
Any thoughts, alternative solutions?
The electron/desktop navigator currently "augments" the Locator data structure, not by extending it (strictly-speaking), but via composition:
LocatorExtended
composes Locator
and SelectionInfo
(as well as a few other useful properties). SelectionInfo
is a JSON-friendly serialization format for a DOM Range, i.e. a "start" and an "end" pair of positions within a document (possibly at the character / TextNode level).
The Locator
refers to the "start" position of the selection, in document order. In the current implementation, the resolution of the Locator
is limited by the usage of CSS Selectors (i.e. DOM Element), even though the selection itself affords character-level precision. I will harmonize this in a future revision, when the APIs stabilize.
A word about the "visual mapping" technique: right now, in order to "turn to the correct page / scroll to the corresponding offset" when a Locator
link is followed (e.g. bookmark, selection, TOC item, etc.), the navigator relies on "client rectangles" coordinates (which is more reliable than the bounding box approximation, or scroll offset, due to CSS box model intricacies). The "beginning" of a targeted DOM element can be located on one page (i.e. column), but its end might be on the next page, which may be out of view (e.g. clipped paragraph, spills beyond the visible viewport). With character-level positioning extracted from the SelectionInfo
, it is possible to use the same "client rectangles" coordinate system, in order to resolve the most accurate visual reading position. For example: the selected text might be a short fragment at the end of a large paragraph, so with CSS Selectors the resolved position and the user's expectation would likely be out-of-sync (bad, confusing UX).
Follow-up: because Locator
and SelectionInfo
do not correspond to the exact same document location, strictly-speaking there is no redundancy in the text
property. But there is obviously some degree of redundancy in the expression of the actual "reading location", due to the different resolutions.
I've seen the SelectionInfo
and RangeInfo
models, but they are really tied to HTML. We need to come up with a solution that would work for all publication formats in the API.
Ideally, we would have a way to express the RangeInfo
directly inside the Locator
, maybe through a custom fragment (eg. [cssSelector]/index/offset
).
Here is the relevant part in the doc where I mentioned this: https://docs.google.com/document/d/1W_kbSRve7c1ZtyYTZ-fbEOsFZPTqxN0IV7UdvH_h20U/edit#bookmark=id.jc195922qtyf
If we decide to implement a per-format custom model like SelectionInfo
, we need to specify it so that it's implemented the same way on each platform, otherwise the annotations won't be cross-platform.
In my opinion, Locator
already has everything needed, if we figure out a fragment format that contains the data from SelectionInfo
. But maybe I'm missing something, Daniel?
I agree with you :) I was just doing a recap / status report, in order to highlight the limitations of the current ad-hoc approach.
Regarding RangeInfo: although I am personally satisfied with the ad-hoc JSON-friendly serialisation format used in the TypeScript code of the R2 desktop SDK (navigator), I would like to hear more from EvidentPoint / Juan who have explored alternative DOMRange marshalling solutions in their GlueJS experiments. Perhaps there are certain pros/cons we should carefully examinate before committing to a particular solution.
Using two
Locators
is not great because the selection text (LocatorText
) is duplicated and it's not obvious which one should be used.
Some additional thoughts on that:
- text is not required in a Locator, which means that in use cases where we need two of them, we could simply avoid including text at all
- for large ranges, there are potentially issues that are more legal than technical (a large range could be seen as a way to extract text out of a publication)
The .text is optional because not necessary when passing the Highlight to the navigator, and for protected book it would be empty (to avoid bypassing the copy rights).
This raises a concern. We want Locators to be useful on their own, if they ever become orphans (publication is updated or no longer available). We need to find the right balance between having an empty text highlight and respecting the restrictions expressed by rights holders.
@HadrienGardeur
text is not required in a Locator, which means that in use cases where we need two of them, we could simply avoid including text at all
This sounds like the solution with the LocatorRange
object. Unless you mean that we don't pass the text selection at all, and just have a tuple (start: Locator, end: Locator)
? In which case, how does the host app get the selection text? We need it to present the highlights outside of the book, or in an annotation panel inside the book.
for large ranges, there are potentially issues that are more legal than technical (a large range could be seen as a way to extract text out of a publication)
Maybe a hard limit for the length of text.highlight
could be added. If we have the range and enough text context to be displayed in a list of highlights, then that's probably enough. Of course the highlight requires then the book to get its full content, and exporting the highlights in another form wouldn't work (eg. PDF export of the user notes).
And/or that could be part of the LCP app compliance tests that the user can't export or copy the text of a highlight extracted from an LCP protected book.
Fragments in the currently-proposed Locator model: https://github.com/readium/architecture/blob/master/locators/README.md#fragments
W3C web annotations selectors: https://www.w3.org/TR/annotation-model/#selectors
We know from implementation experience (currently in R2 SDK for desktop/Electron) that DOMRange can be reliably serialized into a text format / JSON-friendly data structure. This is a direct translation of DOMRange without the pitfalls of EPUB CFI (which can be seen as a useful interchange syntax, but not convenient / performant for the internals of R2 features). Basically, we cannot reference a DOMRange TextNode directly, so we encode the parent Element reference using a quasi-canonical / unique CSS Selector (as already used for Locators in R2 desktop/Electron), and we record the zero-based index of the child TextNode. Other than that, the data structure is exactly the same as the DOMRange object model (start, end, offset, etc.).
So, one option is to "flatten" this tuple notation into a single string so that it fits within the current Locator Fragment proposal (array of strings, reflecting different resolutions applicable to different media types). Personally, I would very much prefer to preserve the "exploded" representation of the range's start and end locations. Instead of having to guess the type of Locator Fragment using some kind of parser (e.g. regular expressions), we would ; like W3C annotations selectors ; identify the fragment types using a predefined vocabulary ... but then we might as well encode JSON property names directly, for example:
-
myLocator.locations.fragments.cfi = "..."
-
myLocator.locations.fragments.cssSelector = "..."
-
myLocator.locations.fragments.domRangePos = { cssSelector: "...", textNodeIndex: "...", offset: "..." }
... instead of:
-
myLocator.locations.fragments = [
-
{ type: "cfi", value: "..." }
-
{ type: "cssSelector", value: "..." }
-
{ type: "domRangePos", value: { cssSelector: "...", textNodeIndex: "...", offset: "..." } }
-
-
]
Thoughts?
Then, we indeed have the problem of myLocator.text[before|after|highlight]
, more specifically myLocator.text.highlight
which applies to a range, i.e. a tuple of two Locators, one of start, the other for end (if we chose to represent a DOMRange using two such Locators).
https://github.com/readium/architecture/blob/master/locators/README.md#the-locator-text-object
It seems that we got a consensus yesterday on the highlights model, but before we close this issue, I feel I must challenge one of its hypothesis: "a range can span two documents (eg. FXL spreads) so we need two hrefs".
The reasons why I'm challenging it is that if it was not a requirement, we could extend the expressivity of some fragment expressions and get ranges in a simpler way. Already, cfis, media fragments (especially time based, but also rectangles in a sense), html ids to a certain extent can represent a text range or an audiovisual segment. It would then suffice to extend css selection expressions and we would get ranges in a single Locator.
It is true that Acrobat allows selecting text across PDF pages (but for Acrobat these may not be different "resources". But I tried iBooks and it does not allow selecting text across resources. I tried Aldiko iOS and could not select text across CSS columns.
So, are we sure we are not creating a complex model with no real requirement?
I think for EPUB it might really not be needed, since content will rarely be split between two resources (maybe for some FXL layed out like a PDF with split paragraphs). But more importantly, I don't think we can make the native selection work properly (or at all) between two webviews or even two iframes. The only way to make this work would have to either build the selection from scratch with overlays or have a convoluted UX to merge two highlights.
For PDF we don't need it either since a single PDF document is one href
.
Personally I'd be more comfortable having a single Locator
object representing the full range, including the text
. But if we go down this road we still need a way to express the range in one single Locator
. Having a custom range fragment format would work, but we then lose the expressiveness of the other discrete location formats. Also, the Locations
object contains additional properties that are by nature for a discrete location: position
, progression
. In which case what does it represent? The beginning of the range, the center?
In this context, I suggest again my initial proposal which is to have Locator.locations
be an array of Locations
objects instead of a single one.
- If the array is empty, it's a Locator to the global resource (like currently having a null
locations
) - With one object, then it's a Locator to one discrete location
- With two objects, then it's a range for highlights
- With more objects, it represents a path that could be used for example to represent a drawing on a page
@mmenu-mantano I like this last proposal but in such a case, what to do with fragments which can already represent a range, like "t=1,10"? Do we state that the must not be used and that we consider only fragments which represent a single "point" (in space of time)?
And how do we justify that we don't take the same approach as Web Annotations selectors, in which a single fragment can represent a range?
Note that even a css id is mapped to a "segment", not a single point in the text. Same for xpointer, xywh, even page for pdf.
Yes usually range fragments can also express a discrete location. I think it makes sense to ignore ranges in a single Locations
object since the other properties (or other fragment formats) can't express a range.
Also, this problem is there too if we use a containing structure with two Locators
like the aforementioned LocatorRange
.
Maybe it doesn't answer all formats but if a fragment is mapping to a rectangle (eg. a PDF page) then the top-left (depending on reading progression I guess) is the discrete location in this case.
I'm not familiar with the Web Annotations selectors, but if they offer a better solution I'm all for it, all the current solutions feel awkward in the Locator
model.
If we ignore the case of a range expressed over two different resources, then there's no need to extend the Locator
model at all:
- in an audiobook, a range can be expressed using
t=10,45
- in a visual publication, a range can be expressed using a rectangle with
xywh=160,120,320,240
- in an EPUB, a range can be expressed using a CFI, an XPath or any other fragment capable of expressing a range on a tree structure
Also, the Locations object contains additional properties that are by nature for a discrete location: position, progression. In which case what does it represent? The beginning of the range, the center?
IMO this should be the beginning of the range, since that's where you jump to when you want to present a range.
I'm not familiar with the Web Annotations selectors, but if they offer a better solution I'm all for it, all the current solutions feel awkward in the Locator model.
The Web Annotation model is very flexible, IMO a little too much for its own good.
As I said in my previous comment, it defaults to a single "selector" (equivalent of our locations) using either: fragment, XPath, CSS or text.
If the range can't be easily defined that way, it uses a specific "range" selector. If we ported that to our model, it would look like that:
{
"href": "http://example.com/track6",
"type": "audio/ogg",
"title": "Chapter 5",
"locations": {
"progression": 0.607379,
"totalProgression": 0.50678,
"range": {
"start": {
"fragments": ["t=389.84"]
},
"end": {
"fragments": ["t=529.52"]
}
}
}
}
I think this is not necessary in our case and I'd much rather rely on a single location to express range.
That's fine with me, so we just need a range fragment for EPUB that we can use in R2 (CFI is out by consensus).
It looks like Daniel's RangeInfo
model fits the need, we just need a way to encode it as a fragment.
Do you have any opinion on Daniel's comment Hadrien? https://github.com/readium/architecture/issues/95#issuecomment-492747951
If we want to keep the semantics without changing too much the JSON model, we could also use the PDF fragment approach with a query string:
locations.fragment = "selector=div[3]&index=3&offset=25"
EPUB CFI is indeed fit for purpose when it comes to expressing DOM position and range, as a standardized interchange format across systems.
However, in my opinion (and based on implementation experience) generating and parsing CFI references is error-prone on the border-cases, and it is also woefully inneficient.
Right now in the R2 desktop/Electron implementation an ad-hoc serialisation format is used to reliably and cheaply marshal DOMRange objects. Basically, a CSS Selector is combined with a child index, and a character offset (this 3-tuple is used for both the start and end positions). This is as close as it can possibly get to a native DOMRange, resulting in extremely efficient processing.
It would be a shame to flatten this simple JSON-compatible data structure into a string "fragment" inside the Locator payload. I would much prefer transporting this information across the reading system layers (typically: app <--> navigator) in its raw form, rather than unnecessarily massaging it just to fit into a prescribed Locator format (which would require awkward escaping rules in order to cater for the fragment's own syntax, delimiters etc.).
That is not to say that there is no value in the flattened "fragments", which are indeed naturally amenable to URI encoding (e.g. like the elegant Media Fragment t=0,99 for time ranges), and therefore desirable for interchange purposes.
We can certainly produce interoperable position/range references a-la W3C selectors, whenever necessary (for example when such Locator payloads are exported into non-R2 systems, like an external annotation service). But for the subsystems involved in R2 reading systems (navigator, etc.) I believe that we should stay as close as possible to native web browser engine constructs like DOMRange and CSS Selector, which yield the best performance and readability (have you tried debugging CFI?). As for XPath, I guess you mean XPointer scheme which would offer similar capabilities to CFI, but also similar headaches.
Thoughts?
Do you have any opinion on Daniel's comment Hadrien? https://github.com/readium/architecture/issues/95#issuecomment-492747951
I think it's slightly better than the Web Annotation approach, but I'm not a big fan of having a controlled vocabulary for fragments. IMO fragments should be registered alongside a media type (which is usually the case) and we should be able to express potentially any fragment, not just a list of well-known fragments.
Keep in mind that I'm mostly discussing the core model and its JSON serialization, what Daniel suggested IMO makes perfect sense in the context of a specific implementation for example.
@danielweck Are you mentioning this for your internal Locator
model, or for the JSON serialization that will be transferred between platforms and stored in the host app database?
I think we should be designing the format of the fragment that will be stored in the Locator JSON here (https://github.com/readium/architecture/tree/master/locators). Each platform can then implement helpers around the Locator
model to have the locator expressed in a more convenient way (eg. as a DOMRange).
JSON serialization that will be transferred between platforms and stored in the host app database
I do not understand the nuance of this statement. But let's take platform "desktop / Electron" as an example (bearing in mind that the same reasoning applies to any coherent reading system platform, like an iOS or Android app, or even a web app involving its own client/server code). The key subsystems that exchange Locator data back-and-forth (as per the specification we are discussing) are the navigator and the host app. For example: the app receives information about the reading locations (then stores bookmarks either as native Locator + additional info, or as totally bespoke database format), and the app can subsequently instruct the navigator component to restore such reading locations by passing-in the standardized Locator payload. Same with selections/highlights, search module, etc. That is la "raison d'être" of such normalization effort.
The Locator object (herein discussed) transports this information across the navigator / app boundary in a standardized fashion, not only syntax-wise (typically in a serialised JSON form, due to technical constraints), but also in terms of the processing model (e.g. fragments that express different resolutions, precedence order). This works great for things like progression, position, CSS Selectors, time Media Fragments, which are quite consensual.
But here we are now discussing the more complex use-case of precise character-level document range, for which CFI was a semi-successful attempt to offer a reliable consistent model, and XPointer scheme never took off either.
In a platform-specific implementation context (e.g. the iOS, Android, Electron/desktop apps), why would a navigator ever bother creating and consuming CFI or other supplemental fragment types, when they have no concrete use for them? This is why in desktop/Electron we currently "extend" the Locator definition by means of composition (additional proprietary / ad-hoc data structure) in order to fill the gap where standard Locator is not satisfactory.
Now, for "interchange" purposes across different system implementations ; e.g. a streamer server and multiple platform-specific clients ; the Locator payload obviously need to be standardized, just like any other part of the architecture. What I am suggesting is that in the real world, implementations will not bear the burden (performance impact, debuggability, readibility, etc.) of creating processors and executing their code (for example, producing CFI fragments when the receiving end is known to ignore them completely) unless there is a need for generating and consuming the proposed multi-resolution/granularity, "flattened" Locator fragment syntax ... thereby favouring instead the extended / proprietary data structure.
Surely, we want to mitigate this? I can see the cons of incorporating a totally bespoke DOMRange structured serialisation format in the Locator model (which would be at odds with the dominant "linearized string" fragment approach), but I also want to make sure we examine the pros too. For me, the benefits in terms of performance, testability, readability etc. are not insignificant. I would hate to continue to use a totally different system alongside standard Locators (in the Electron/desktop implementation), when there is in fact an opportunity for us all to leverage the same structured syntax for selection/annotation ranges.
Do you understand my viewpoint? (sorry for the long-winded explanation)
@danielweck I think that we're having two separate discussions here:
- changing the fragment into an object (instead of a string currently)
- adding support for DOMRanges
IMO changing how fragments are represented has a lot of downsides and so far, only DOMRanges would benefit from this.
DOMRanges can already be added to a Locator
right now using a string and flattening the representation (selector=div[3]&index=3&offset=25
) or simply by using three strings (["selector=div[3]", "index=3", "offset=25"]
).
As an alternative, we can also define an extensibility model for locations
which would allow something like:
{
"href": "http://example.com/chapter1",
"type": "text/html",
"title": "Chapter 1",
"locations": {
"position": 4,
"progression": 0.03401,
"totalProgression": 0.01349,
"domrange": {
"selector": "div[3]",
"index": 3,
"offset": 25
}
}
}
[...] why would a navigator ever bother creating and consuming CFI or other supplemental fragment types, when they have no concrete use for them?
I'm not buying the argument that CFI is completely going away. It's used in EPUB.js and Vivliostyle Viewer, which both support RWPM. It's also the core fragment being used in Colibrio and remains an important focus for Evident Point.
I don't think it's going to be that uncommon for organizations to run a mix of Web Apps that are not primarily developed by Readium (yet compatible with RWPM) with mobile/desktop apps that are built by Readium.
Keeping the fragments as strings is really important in my opinion:
- A fragment identifier is supposed to be a string of characters
- We can use them with URI using #
- Having a JSON array with multiple possible types makes it much more complicated to parse on a more rigid language such as Swift
So adding a (standardized across R2 platforms) Locator extension for the DOM range could really be a solution. Especially since there's no standard fragment format to represent a DOM range and so this is likely a private Readium implementation.
But keeping everything in fragments
would make it easier in the JS layer to fallback on various fragment formats when available, to find the target location. We might not be interested in using CFI as our default implementation, but someone might want to add support to read annotations exported from other SDKs.
Note that on mobile we will most likely not forward the full JSON locator to the JS layer, since it will only be interested in the fragments and/or the DOM range.
I would hate to continue to use a totally different system alongside standard Locators (in the Electron/desktop implementation), when there is in fact an opportunity for us all to leverage the same structured syntax for selection/annotation ranges.
Definitely, especially to add additional data that should be encoded into the Locator
. However, wrapping a JSON locator into an object that adds helper to read the various data, and maybe build an actual DOMRange
in a provided document, might be helpful thing to answer the per-platform requirements. This is akin to the native Locator
model that we have on Swift/Kotlin. The line might be blurrier on a JS platform I guess.
We really want to close this issue tomorrow, do you have any additional comment @danielweck ?
I do actually :)
In your example:
{
"href": "http://example.com/chapter1",
"type": "text/html",
"title": "Chapter 1",
"locations": {
"position": 4,
"progression": 0.03401,
"totalProgression": 0.01349,
"domrange": {
"selector": "div[3]",
"index": 3,
"offset": 25
}
}
}
...I think we need to distinguish start
and end
:
{
"href": "http://example.com/chapter1",
"type": "text/html",
"title": "Chapter 1",
"locations": {
"position": 4,
"progression": 0.03401,
"totalProgression": 0.01349,
"domrange": {
"start": {
"selector": "div[3]",
"index": 3,
"offset": 25
},
"end": {
"selector": "div[4]",
"index": 2,
"offset": 11
}
}
}
}
Also, as the proposed locations.domrange
JSON property is a "first class citizen" on par with locations.totalProgression
/ locations.progression
/ locations.position
, does this mean that DOMRange information does not need to be serialized (and flattened) into the locations.fragments
array of string values? (see https://github.com/readium/architecture/blob/master/locators/README.md#fragments )
I filed a separate issue to express a more general concern about the concept of Locator "fragments": https://github.com/readium/architecture/issues/98
Also, as the proposed locations.domrange JSON property is a "first class citizen" on par with locations.totalProgression / locations.progression / locations.position, does this mean that DOMRange information does not need to be serialized (and flattened) into the locations.fragments array of string values? (see https://github.com/readium/architecture/blob/master/locators/README.md#fragments )
Actually, this is not my personal preference nor the only solution that I've listed in my previous comment.
Overall, I think that the current approach of an array of strings is perfectly fine and works for what you've requested.
That said, we could use an extensibility model for both the top level of a Locator and for locations
. The example with domrange
was only meant to illustrate how this could work if we had such extensibility available.
Overall, I'm not in favor of extending our current model with anything too specialized, I think the mix of position
, progression
, overallProgression
and fragments
is good enough:
- it's as media-generic as we can be
- it provides both context and the ability to express very specific locations
- there's built-in extensibility through fragments (that can be defined per media type)
We had something more specialized elements before (with dedicated elements for CFI and CSS Selectors among other things) and I think it would be a step back to go back in that direction.
Progress update based on our latest conference call: https://github.com/readium/architecture/issues/99
Here let's discuss the specifics of domRange
.
Full example:
{
"href": "http://example.com/chapter1",
"type": "text/html",
"title": "Chapter 1",
"locations": {
"progression": 0.8,
"fragments": [
"#paragraph-id",
"t=15"
],
"cssSelector": "body.rootClass div:nth-child(2) p#paragraph-id",
"partialCfi": "/4/2/8/6[paragraph-id]",
"domRange": {
"start": {
"cssSelector": "body.rootClass div:nth-child(2) p#paragraph-id",
"textNodeIndex": 0,
"offset": 11
},
"end": {
"cssSelector": "body.rootClass div:nth-child(2) p#paragraph-id",
"textNodeIndex": 0,
"offset": 22
}
}
}
}
Partial example (so we can focus on property naming, and meaning):
"domRange": {
"start": {
"cssSelector": "body.rootClass div:nth-child(2) p#paragraph-id",
"textNodeIndex": 0,
"offset": 11
},
"end": {
"cssSelector": "body.rootClass div:nth-child(2) p#paragraph-id",
"textNodeIndex": 0,
"offset": 22
}
}
locator.locations.domRange.start
and locator.locations.domRange.end
have the same object structure, which is made of 3 mandatory fields: cssSelector
, textNodeIndex
, textOffset
.
locator.locations.domRange.start
is mandatory and can be used on its own (i.e. collapsed range with same begin/end pointer). locator.locations.domRange.end
is optional.
-
cssSelector
(string): always references a DOM element, but if the original DOMRange start/end references a text node instead, thetextNodeIndex
field (described below) is used to complement the CSS Selector. -
textNodeIndex
(-1
, or integer[0, n]
): when-1
, the abovecssSelector
corresponds to the DOM element of the original DOMRange start/end. Otherwise,[0, n]
represents the zero-based index of the text node child (of the DOM element referenced by the abovecssSelector
). -
offset
(integer[0, n]
): if the original DOMRange start/end references a DOM element, then this zero-based index represents the position of a child text node of that element. If however the original DOMRange start/end references a text node, this represents the position of a character within that text node.
Note that in [0, n]
, n
is indeed included even though it represents the total number of children. That is because index overflow is allowed, and it signifies a pointer after the last item, but before the next sibling element starts (just like XPointer and CFI edge-cases ... literally "edge" cases).
Also note that sometimes, mobile iOS/Android selection UX always normalize the original DOMRange start/end so that it never references DOM elements (always text nodes). This normalization step can also be explicitly added in desktop implementations (I am working on that, actually, based on the work of Apache annotator).
See TypeScript prototype implementation which inspired this proposal: https://github.com/readium/r2-navigator-js/blob/7a2ddc8287659d6d985eb930b776e108891a3510/src/electron/common/selection.ts#L14-L33