epub-specs icon indicating copy to clipboard operation
epub-specs copied to clipboard

Adopt the HTML syntax in addition to the XML syntax for content documents

Open bduga opened this issue 7 months ago • 36 comments

Section

No response

Describe the problem

EPUB 3.3 and earlier require the use of the XML serialization of HTML (AKA the XML syntax, formerly known as XHTML). HTML no longer provides new feature drops to the XML syntax and explicitly warns against its use. There are already some issues with scripting and XML, and some potential limitations with upcoming extended accessibility techniques. In addition, most off the shelf tools produce the HTML syntax, and most developers are familiar with it. To roughly quote a working group member "Finding out EPUB is based on XHTML is like finding out Social Security runs on COBOL".

Describe the fix or new feature you propose

Add the HTML syntax as an option for content creators, and define requirements around its support for Reading Systems, distributors, aggregators, etc. while explicitly retaining the XML syntax as a valid format and mandating its continued support.

bduga avatar Apr 24 '25 20:04 bduga

XML was never intended as a passing trend; it was designed with universal, long-term principles in mind.

Like a Rolls-Royce, it was built for enduring reliability — structured, neutrally extensible, and deeply internationalized.

Unlike HTML, where any input yields a parseable document but future parsing results are not guaranteed to remain the same, XML enforces strict well-formedness and guarantees that well-formed documents will always produce the same structure — ensuring the stability and trust essential for the long-term preservation, authenticity, and interoperability of digital publications.

Some may argue that, as a surviving member of the XML Working Group, my views are outdated. However, the strict requirement of well-formedness in XML — and the guarantee that parsing results will remain consistent for all time — are not relics of the past. They are vital strengths that safeguard the long-term viability of digital publications, which are invaluable assets to publishers.

Stability, predictability, and fidelity across decades are not luxuries for publications — they are necessities.

murata2makoto avatar Apr 26 '25 21:04 murata2makoto

Unlike HTML, where any input yields a parseable document but future parsing results are not guaranteed to remain the same,

I do not think that is a fair statement. The principle of backward compatibility is a strong requirement for browsers, as well as for the HTML standard, too. However, the Web community pretty much voted with their feet at the time, with a large percentage of websites that were not well-formed. Browsers had to choose between refusing to display such pages or come up with something feasible for erroneous pages as well. HTML(5) was born by the necessity to make the results of HTML parsing identical across browsers and, in fact, the HTML5 standard is pretty much a reverse engineering text for the browser implementation at the time. And, frankly, seeing the Web pages out there, the results are impressive (I am not talking about differences in CSS implementations, availability or not of various APIs... those are orthogonal to markup).

I do not think HTML is less predictable, or prone to change in the future, than XML, just more difficult to formalize (the reverse engineering style of the HTML standard makes it very difficult to read indeed) and maybe more difficult to check. But those problems have been solved, and the gain is that producers of Web pages are a bit less constrained than using XHTML.

iherman avatar Apr 27 '25 09:04 iherman

We should try to avoid turning this into a discussion of the merits of each syntax. Each has their own strengths and weaknesses and publishers will decide for themselves which makes the most sense to produce for both their near- and long-term objectives.

By adding support we're not taking a side in that debate but only acknowledging that there's an eventuality where html processing may be the only way to get certain features to work.

We can prepare for that now by moving to support the html syntax or we can wait until our hand is forced to figure out what to do.

mattgarrish avatar Apr 27 '25 12:04 mattgarrish

Each has their own strengths and weaknesses and publishers will decide for themselves which makes the most sense to produce for both their near- and long-term objectives.

But this issue is not formulated as such. It is biased.

The Journal Article Tag Suite (JATS) is an XML dialect used for representing scientific literature published online. STS, derived from JATS, is an XML dialect designed for representing ISO standards. The Question and Test Interoperability specification (QTI) is an XML dialect used for representing assessment content and results. These three XML-based languages are widely adopted by those who prioritize information assets. Similarly, people who use EPUB for commercial publications do not treat HTML as the representation format, even though EPUB readers may use HTML for rendering purposes.

murata2makoto avatar Apr 29 '25 15:04 murata2makoto

Each has their own strengths and weaknesses and publishers will decide for themselves which makes the most sense to produce for both their near- and long-term objectives.

But this issue is not formulated as such. It is biased.

I fail to see how adding a choice can be biased. Restricting to a single choice among two definitely shows bias, but allowing all (both) options seems the exact opposite. While the bug does list some reasons authors may prefer the HTML syntax, that is simply to explain why the choice needs to be added. Sometimes one syntax will be the better choice, sometimes the other syntax will be the better choice.

bduga avatar Apr 29 '25 15:04 bduga

But this issue is not formulated as such. It is biased.

Even if that were true, no one is going to decide whether to use the xml or html syntax based on this issue's problem description.

The EPUB 3 specification itself isn't going to make claims about which syntax to use. Absent a critical flaw that would make xhtml unworkable, which obviously doesn't exist right now, it's not our place to direct publishers to one or the other syntax.

Allowing the html syntax will be a neutral addition if we ultimately go this direction.

mattgarrish avatar Apr 29 '25 16:04 mattgarrish

And to further clarify, EPUB 3.3 already allows HTML therefore any constraints placed generally on HTML (not on the syntax) also already apply to EPUB documents. Whether tools exist to fully identify deviations from the existing standards is another question not related to adopting the HTML syntax. It is also worth pointing out that, while epubcheck is a useful tool with a close relationship to this WG, it is not actually a product of this working group. Questions or issues with epubcheck should be taken up with the maintainers of that tool.

bduga avatar Apr 29 '25 19:04 bduga

To be clear, I am not opposed to adding the HTML syntax for those who do not prioritize information assets, yet still bother to use EPUB for some reason.

murata2makoto avatar Apr 29 '25 23:04 murata2makoto

I am not opposed to adding the HTML syntax for those who do not prioritize information assets

To answer that seriously, it has always been a knock against epub that we cater to the needs of big publishers with xhtml over everyone else. The html syntax has the ability to make life a little easier for self-publishers who don't have the same goals for long-term preservation.

It's speculative whether adding the html syntax will broaden the use of epub beyond traditional publishing, but we'll never answer the question of whether there's a base of web/html users who might find epub useful until we do.

mattgarrish avatar Apr 30 '25 13:04 mattgarrish

(Admin note: there is a very relevant discussion on the WG's mailing list, started by Eric Hellman (@eshellman) on Fri, 9 May 2025 10:07:54 -0400, https://lists.w3.org/Archives/Public/public-pm-wg/2025May/0008.html. For an easier reference and having all discussions in one place, the thread has been copied here.)

Start of the the discussion by @eshellman

I'd like to comment on allowing html syntax in EPUB 3.4.

While I understand the technical reasons behind the change, and agree with them for the most part, allowing html syntax in EPUB 3.* would be a terrible marketing decision. Because of this, Project Gutenberg would not implement 3.4 and I would counsel other organizations I advise not to implement any support for it. We have touted our implementation of EPUB3 for over 75,000 titles despite its inconsitent implementation in reading systems. When something doesn't work it causes support issues for us. We continue to produce EPUB2 files because certain strongly desired functionalities don't work in systems that claim to support EPUB3. (I'm looking at you, ADE.) It's clear that there will be reading systems that just won't work with HTML syntax, and users of those systems will have no way to know if the files they acquire will work with the systems they use. Even if we were to produce EPUB 3.4 files with XML syntax, we would struggle to communicate that to a user who has experienced failures with other EPUB3 files. Those failures would be black marks against the EPUB label or "brand".

Has anyone articulated a benefit to end users for this change?

By contrast, a distinguishing label like "EPUB+" for even this technically modest change would encourage adoption by reading systems developers, and by their customers. For distributors like us, it would allow us to easily communicate a modernization step without a lot of work on the backend.

Answers in the thread

The answers may be followed on the mailing list archives, and have been reproduced here for an easier reference:

Matt, @mattgarrish Fri, 9 May 2025 10:35:01 -0400

By contrast, a distinguishing label like "EPUB+" for even this technically modest change would encourage adoption by reading systems developers, and by their customers.

Could you clarify what you mean by this? EPUB 3 is a continuous line with all files being identified by version="3.0" in the package document. There would be no reliable way to differentiate an "EPUB 3.3" file from an "EPUB 3.4" if they both use the XML syntax.

Are you asking for a version number change to make an "EPUB+", which would put this out of scope for this revision, or are you just wanting us to find a way to differentiate EPUB 3 files with the HTML syntax from EPUB 3 files with the XML syntax without changing the version? (I think the latter could be done, for example, using a dc:format tag with a required identifier.)

Eric, @eshellman Fri, 9 May 2025 11:58:08 -0400

I'm asking for more consideration of end users who would have no way of knowing whether EPUB3 files will work on the reading systems they use. These are not people who do not know what XML syntax is!

I would love that have an EPUBish format that packaged HTML files. but dont give it a label that creates angry end users!

Matt, @mattgarrish Fri, 9 May 2025 13:16:28 -0400

Sure, that’s a fair point. I’d prefer to hear everyone’s opinions on this so I don’t think it’s productive to argue for or against any of the feedback we get. I was just curious from that comment if there was a technical solution within EPUB 3 that you envisaged so that users could be alerted to the type of EPUB 3 publication that they were getting or if you are arguing for a brand new format of EPUB.

Eric, @eshellman Fri, 9 May 2025 13:28:51 -0400

More the former rather than the latter, but mostly I'm arguing for a new label. Sometimes an author comes up with a great title first, and the story then just writes itself.

Laurent, @llemeurfr, Fri, 9 May 2025 19:44:19 +0200 I understand what Eric means. Without a strong flag in the "pure HTML" EPUBs (especially if not well-formed), some (many?) reading systems will open such a publication but will display it badly.

A good question is: Do EPUB reading systems in the wild test the OPF package/manifest/item/@media-type before accepting a publication? If yes, the presence of "application/html" will be a strong warning that there is something special here. If not, I see no possible warning as long as we keep package/@version = a static "3.0".

Matt, @mattgarrish Fri, 9 May 2025 19:13:42 -0400 > If not, I see no possible warning as long as we keep package/@version = a static "3.0". For reading systems with no awareness of the changes and for users that have obtained the publication without it being identified as not your regular EPUB 3, sure. A new version is the only universal way that a reading system will know it can’t handle what it’s getting.

But we’re not necessarily out of options for flagging the content for distributors or reading systems simply because the version has to stay the same. Like I mentioned earlier, we could require dc:format with a designator for html-containing publications. Epubcheck would easily be able to check the media types and ensure the flag is set, for example.

We could also add a new attribute to the package element to flag these publications. That might be a bit more controversial approach, but it seems unlikely that a new attribute will cause any issue processing package documents (we had similar worries about adding the collection element and it went ignored by everyone with no use for it, which was kind of everyone in the end sadly).

Anyway, that’s just a couple of quick ideas that came to mind. There may be other ways of handling it, too, like in the container.xml file. But this is useful to consider more if we go ahead with the change.

Ivan, @iherman, Sat, 10 May 2025 08:48:39 +0200

Looking at the thread, isn't it correct that the same set of arguments may be valid for any technical additions to EPUB 3.3? Say, if we reinforce the usage of WebVTT, add new features for Webtoons, or add heic as a valid image format? We are concentrating on HTML here, but it is not in a unique situation (though probably the most prominent). In other words, any technical addition to EPUB 3.3 might end up at the same place.

Ideally, we would communicate the changes with the official version number, but that we cannot. Calling this EPUB 4 might be disastrous as well, insofar as no publishers would even look at it (that was the discussion we had on the version numbering which led to the restriction in the charter). That is a dead end.

I jump on the idea of Matt about the usage of dc:format (or equivalent) in the package document, which I like. dc:format[1] is fairly open, in the sense that it can also be string. What if we define some sort of mini-syntax for that value, which lists some "features" with pre-defined names? Authors MAY list the extra features they use in a publication; reading systems may look at this, and they may warn the reader if the publication relies on a feature they do not implement. Such a property may also be useful for older, but less implemented features of EPUB 3.3 (e.g., whether javascript may be used in a publication or not).

[1] https://www.dublincore.org/resources/userguide/publishing_metadata/#dc:format

Matt, @mattgarrish Sat, 10 May 2025 07:23:48 -0400

I’d argue that adding html is much more fundamental than the changes you’ve cited, which is why it may need special attention. There’s no graceful degradation for an html epub except to reproduce everything in html in xhtml, which defeats the purpose of using the other syntax. We don’t want epub to become the carnival show of content where you get to step right up and see if anything displays at all after giving away your money.

It strikes me as a kind of reversal of previous discussions we’ve had about profiling reading systems so that users can figure out the capabilities. But that’s shifting a technical burden onto users that they don’t want to know or care about. The vast majority of users have no idea what an epub is under the hood and probably little more knowledge of what their reading system supports. Telling them they’re about to download an “EPUB 3+” will probably grab more attention, as Eric says, than listing what the publication needs for support when you go to purchase, or having to warn about all the features that probably won’t work before they can fully experience the book.

But I promised not to argue, and this is an issue we probably shouldn’t try to solve at this stage, and not by email. It would be good if we can get this logged in the issue tracker so it stays on the radar.

Laurent, @llemeurfr, Sat, 10 May 2025 14:32:40 +0200

This discussion helps defining questions to ask to reading system developers. Reading systems are certainly the hardest part of the chain for managing evolutions because they are so many and they have historically not been the part the IDPF was considering most (RS guidelines are a new thing).

The real issue is about « old » RS, badly written and quasi unmaintained, which won’t look at any flag we may specify (mime type or dc:format). And the issue is not only for html support, but for any evolution where graceful degradation is not historically supported by all (?) RS.

Either we consider that some of them are obsolete and cannot pretend supporting EPUB3 anymore, or EPUB3 is frozen, apart from cosmetic additions.

Ivan, @iherman, Sun, 11 May 2025 09:33:47 +0200

I’d argue that adding html is much more fundamental than the changes you’ve cited, which is why it may need special attention. There’s no graceful degradation for an html epub except to reproduce everything in html in xhtml, which defeats the purpose of using the other syntax. We don’t want epub to become the carnival show of content where you get to step right up and see if anything displays at all after giving away your money.

I would still argue that, to take one example, using heic as an image format is very close. There is no graceful degradation for that either, unless the author provides several image formats for the same image, which defeats the purpose, as you say. Yes, HTML is more "dramatic" as a change, but with the same set of arguments, we shouldn't have had accepted WebP format in EPUB 3.3… (I am not arguing for HEIC, just use it as an example).

Dale, @dalerrogers, Sun, 11 May 2025 19:00:40 +0000

I see my role here to advocate for content creators; especially the next generation. As a web designer, i went to the W3C to understand how to prepare my documents.

No one is learning XHTML. It is a discontinued language. HTML is living and being expanded. It is becoming more accessible over time. EPUB should reflect that change. I know that browsers can handle both versions of markup. I’m hoping that EPUB reading systems will be able to stay current, and future proof, as well.

Matt, @mattgarrish Sun, 11 May 2025 18:20:21 -0400

using heic as an image format is very close. There is no graceful degradation for that either

Yes, but this is also why foreign resources are rarely used in EPUBs and why we rarely turn them into core media types. The ones we have, like webp, have been part of the browser cores for some time before we added them, and even then I’d be surprised if publishers use them a lot. (Plus what has it been, like three added in fifteen years?)

But your epub won’t generally crash a reading system if it has a new image type, unlike what html will do to reading systems built around processing xml documents. There are other things you can do for images than provide another format, too, like add alt text. Users are still going to complain and want their money back, but at least something is there. That’s what I meant by graceful degradation. Users will likely get some functionality (or limited functionality) in these other cases.

And it’s not like running the latest reading system is a choice. Depending on your device and operating system, you may not have access to updated versions. It’s also unrealistic to expect users to switch reading systems because we changed the format. In a retail world, you can’t be cavalier about breaking content.

It’s not an argument against html itself, but only why I think this change has greater ramifications than any of the CMT changes we’ve made to date, and why I’d prioritize identifying html epubs over worrying about the minor features that are going to break. I’d expect the big publishers are testing their content and not going to go near any features that don’t consistently work. We don’t have to worry about them all.

iherman avatar May 12 '25 10:05 iherman

To follow up on my last comment in the email thread, I'd like to revive a proposal made a long time ago, that I almost see in Matt's email from Sat 10th: specifying EPUB profiles and have them appear to users before they buy an ebook.

If we look at the issue from the side of reading systems, there are two main classes:

  • Reading systems that support EPUB 2 and a limited set of EPUB 3 features (some of them do not support EPUB 3 FXL); they are usually not based on browser engines, and their development is almost frozen.
  • Reading systems that evolve with the EPUB 3 spec; they are based on modern browser engines and have active developers.

Let's imagine that publications are tagged with a profile (e.g. "EPUB basic" vs "EPUB plus") conveyed both in ONIX and the EPUB package itself (dc:format?). Let's also imagine the reading system market themself as supporting one profile or the other, and retailers to the same for the publications they provide (from the ONIX information). We would then get what we need: keep EPUB 3 progressively evolving, and give users an indication before they acquire an ebook.

llemeurfr avatar May 12 '25 11:05 llemeurfr

Trying to detect an EPUB 3.3 file on ingestion to conditionally deal with new markup has proven to be challenging. Anything we can (successfully) do to enable detection of a specific EPUB version number, while maintaining '3.0" in the OPF, and not requiring complete parsing of the entire file, would be a significant win.

rickj avatar May 12 '25 13:05 rickj

In general, nothing gets adopted unless users ask for it.

We can learn from the history of adoption and non-adoption of EPUB3. 14 years after its launch, EPUB2 is still hanging around! As the person largely responsible for the implementation of EPUB3 at Project Gutenberg (75,000+ titles), I can confidently say that the prime motivation for doing so was the accessibility capability of EPUB3. Nobody asked for any of the other features of EPUB3.

Who is going to ask for HTML in EPUB3? And more importantly, how are they going to ask for it?? That's why I suggest to give the HTML-in-EPUB format a name that people can ask for, otherwise adoption will not occur for 14 years.

eshellman avatar May 12 '25 13:05 eshellman

In general, nothing gets adopted unless users ask for it.

I am not sure I agree with that. Users typically don't ask for internal format changes, but EPUB has historically gone through several (HTML -> HTML with CSS -> modularization -> XHTML -> XML syntax of HTML). One of the first things I did on ebooks back in the '90s was to convert Gutenberg volumes to HTML (back then they were all plain text). The resulting tag soup HTML with no styling is very different from what is required today, but for an end user there wasn't much difference visually. Obviously it was less accessible, but the accessibility features I used (font size changes and brightness) are fairly similar, and the Dracula I read back then is pretty much indistinguishable from the one I can read today, aside from physical display improvements. I am not sure what feature I would have requested that would move us on from that initial implementation.

Who is going to ask for HTML in EPUB3? And more importantly, how are they going to ask for it?? That's why I suggest to give the HTML-in-EPUB format a name that people can ask for, otherwise adoption will not occur for 14 years.

I think the answer is not going to be end users. Of course, those aren't the only users of the spec, so if we broaden the definition of user, then the likely answer will be (some) content creators, perhaps with some reading system developers. Use of XML already limits the use of some JS libraries, so requiring it can be problematic for people pushing boundaries. It is likely not a huge issue for medium to large publishers, conversion houses, or even larger non profits, but there are certainly indie authors out there who discover that their existing knowledge or tooling is not compatible with EPUB creation. And while we can certainly wait to make this change (the XML syntax won't disappear tomorrow), the longer we wait, the more entrenched XML becomes - time will not make this transition easier.

Speaking as an ex-RS implementor, I don't actually care about the HTML syntax, and find the entire transition to be busy work with no immediate gain - I would still show the same books the same way. In fact, many RS implementations already display the content as if it was the HTML syntax, but likely depend on it being well formed XML at some point. But if that was the criteria we used, we would still be on unstyled HTML 1, which also doesn't sound great to me. At some point we will need to move on.

bduga avatar May 12 '25 20:05 bduga

I see my role in this discussion as advocating for the content creator. I realize the burden it puts on existing systems and groups. I also know that EPUB needs to evolve just as HTML and CSS are evolving. They are becoming more accessible. They are becoming more responsive. They are becoming more semantic. I've always seen the role of the W3C as defining the way things should work for the good of advancing the technology.

New designers will not come to the table with skills in XHTML. They will come to the table with skills in HTML [5]. It makes me wonder what this discussion is about. If RSs already parse HTML, then are we talking about the limitations of the validators? Coders look to the standard to understand how to create an EPUB document so that it validates. Where is the bottleneck in moving forward?

dalerrogers avatar May 13 '25 00:05 dalerrogers

In general, nothing gets adopted unless users ask for it.

Hi @eshellman, from our experience, users tend to ask for an ebook that causes them no headache with the reading system they have at their disposal. EPUB2 has been hanging around because ebook distributors and reading systems in the wild were not ready for something better, and I agree that no user has ever asked for EPUB3. No user will ask for some "EPUB+"; most don't know what EPUB is. This HTML move is an evolution pushed by technical constraints. And also a few authors (we have requests from Brazil).

llemeurfr avatar May 13 '25 08:05 llemeurfr

users tend to ask for an ebook that causes them no headache with the reading system they have at their disposal

And this is the problem with trying to shoehorn the html syntax into epub 3 15 years later. We can do it, of course, but when will any vendor (or publisher) feel safe taking in and selling EPUBs that are going to fail in various ways on existing reading systems? Have we really compressed the timeline for adoption over doing an EPUB 4?

We're adding another "feature" that authors are going to learn the hard way doesn't work universally. It's a bad look for the format.

mattgarrish avatar May 13 '25 11:05 mattgarrish

I was the sole opponent of EPUB 3.1, which saw no adoption whatsoever. I made the effort to oppose it strongly because I believed it might be adopted and cause real problems. This time, I do not strongly oppose the addition of HTML syntax, as in my view, no existing EPUB users make use of it.

murata2makoto avatar May 13 '25 23:05 murata2makoto

@eshellman and all We will discuss this HTML issue in the WG call tomorrow (15th May). So please join to the call if you are convenient.

shiestyle avatar May 14 '25 13:05 shiestyle

This was discussed during the pmwg meeting on 15 May 2025.

View the transcript

HTML in EPUB - w3c/epub-specs#2715

wendyreid: We were discussing adding HTML to EPUB online and via email, let's continue the discussion here

duga: I don't know about adding a flag about HTML in the file
… I'm not sure what this would do, it is something we've tried before

George: with Rick's suggestion about HTML vs XHTML, adding a flag in the OPF would avoid that

Charles: It would be nice from a consumer point of view — you'd like to know if a book would work on your reading system

wendyreid: it would be nice from a vendor perspective to differentiate books, like fxl vs reflowable,
… it might be hard to communicate to a consumer the difference between XHTML and HTML
… we could say something like "might not work on some reading systems"

duga: agree it might be difficult to exchange, and a flag would be redundant with the mime types in the manifest
… publishers may not be accurate about the flag, they haven't been accurate with similar things before

CharlesL: The cost of us adding the flag is negligable
… a reading system could then make a patch that could utilize it to tell users if a book would work on their system

<gluejar> "quick patch" = ~3 years, I think

CharlesL: plus a flag could be used to get data about how many publishers are using HTML

gautierchomel: Its not that the publisher will use this information,
… what about a file with mixed HTML and XHTML content?

wendyreid: publishing moves slow, if we release this change tomorrow, we won't see HTML epubs in the pipe line
… they will be some people who try it right away, some people are already doing this internally
… it will take time for us to see this, but that's good
… we will see people and platforms experiment with it
… what does HTML offer the XHTML doesn't?
… this change is more about longevity

mgarrish: We don't need to get into the solutions right now
… it is important to find out what the issues are across the board
… vendors who don't want HTML will simply not accept the content
… vendors will be the barrier, this will take a long time
… we have problems that already exist, I'd like to go back and fix those
… it may take time, 10 + years, before the tools and all people adapt for HTML

eric_hellman: I've maintain the software that turns html files into epubs at project guttenberg
… we regenerate our files every month
… we are able to do this for a lot of files without much intervention
… we have changed our preferred format from XHTML to HTML5
… my software deprecates that to EPUB3 and EPUB2
… we give our books away for free and have no control over the platforms
… there is a demand for EPUB2 it works better
… the difficulty is, we have millions of users and thousands of different reading devices
… we are conservative about what we allow
… we don't want HTML5 in the EPUBs because we will get problem reports
… we have no staff to deal with this, so we strip out new things when we make the EPUB files
… we control the distribution and production, but no control over the reading platforms
… things that should work across devices but don't, are a source of problems
… for instance ADE doesn't respect CSS pagebreak in EPUB 3, but did for EPUB2
… we have to watch out for any potential problems
… if people are getting errors with files using HTML in the EPUB, we will stop using it
… I would love to be able to put HTML5 in an EPUB
… my concern is the rate at which reading systems will be able to render the HTML files
… a lot of people experienced with the XHTML files made mistakes producing HTML5 files
… I fear that a lot of y'all may underestimate the failures that occur when you try to render HTML5 with something used to seeing HTML4
… I was surprised with the failure modes we found looking over our 75k+ flies and all that the reading systems do
… this group should think about how to promote adoption of new features
… especially features that could ruin the experience for the reader
… just having a flag will not be enough
… I'd like to see an HTML5 only format that could be the bright new thing
… with appropriate branding
… and some level of javascript, sound rendering
… a new brand could be promoted by the reading systems, "we can now do this"
… not doing the branding has lead to slow adoption of the features in EPUB3
… if you really want to have new features adopted, you need to give them some gold stars they can use to sell their reading systems

wendyreid: This is incredibly valuable information, we've been talking about EPUB3 adoption
… what we've found is that EPUB3 is well adopted
… but we don't have any examples except ADE
… because ADE is no long being developed
… you have lots of information that would be really helpful for us
… can that be shared? It would answer questions for us

CharlesL: Thinking down the line 10 years, now reading systems only support HTML, they can use a flag to identify XHTML files that won't work anymore

DaleRogers: If I'm creating an Ebook, and I want it to be in KDP, I need to go to their docs to find our what they support
… as a content creator I always have to start with the output and design my files to be compatible with the outpt
… as part of this group I put on a different hat
… I hear what people are saying about some reading systems being up to speed and some not
… do we always have to check what standard a reading system supports?

ivan: CSS features update almost once a week, we expect reading systems to follow this
… some do and some don't, and then the rendering can go wrong
… particularly a problem for accessibility
… HTML5 is a similar problem
… in EPUB3.3 we worked on a testing suite for the purpose of the W3C process
… this suite was used to test reading systems
… the results are publicly available
… if we put more care into this testing suite it could become useful beyond the W3C process
… rendering systems may or may not choose to share the results of the testing suite

shiestyle: As a publisher we do not provide HTML based EPUB
… since we have to provide encrypted epubs to some vendors we can control versions
… it might be safer for us to expand EPUB generation to HTML for our complicated Japanese titles

duga: I develop reading systems, I would think it was bullsh*t to have to go through every line of code
… where I made decisions about whether a file is renderable
… and I'll be right where I started rendering Ebooks after all that work
… we let other people determine our formats
… and they don't care about XHTML
… at some point, someone will delete the XHTML syntax
… and we will be left with a lot of content built on XHTML

wendyreid: a special challenge with EPUB is that we are reliant on standards built with a different industry than ours
… that industry moves a lot faster than ours
… there is an argument we could make that we need to fence off a version that addresses our specific needs
… we are trapped, when tools and browsers evolve, we won't be able to respond
… lots of platforms are on the web or use browser engines
… it is possible that XHTML spec can be deleted and EPUBS will break
… this is a scary change, but important for the long term usability of the format
… testing is a good thing to raise
… I have a test book with HTML files, I will upload it so reading systems can use it
… we also have the survey and we can use that to find challenges we aren't seeing

<wendyreid> See: survey text proposal.

GeorgeL: it is clear to me from surveys that people who are complaining are using EPUB2,
… it is time for people to update their systems to EPUB3
… and get them to use something better than ADE which is bad for accessiblitiy

AveneeshSingh: EPUB is a business format
… release 3.4 with traditional tech, and 3.4+ with HTML and see how it is adopted
… the browser people aren't listening to us, we need to take a step and allow the world to evolve

wendyreid: we could talk about this forever, we don't have the information we need
… we don't know what the challenges and the issues will be

<wendyreid> https://github.com/w3c/pm-wg/pull/21/files

wendyreid: we can't move forward until we know more
… everyone should read over the survey and get feedback from the industry
… until we have that information we will keep circling on the topic


iherman avatar May 15 '25 14:05 iherman

I have the impression that there is a consensus on the fact that, eventually, we will be forced to allow non-XML HTML content documents, and that the later we do that the more difficult it might become. I.e., in my view, we should do it now.

I also have the impression on an emerging consensus that we also have to find a way to "announce" all this to the community. We discussed OPF metadata, EPUB 4, Profiles, etc. All these are fundamentally the same: a suitable way for a RS to say that it accepts a book with a particular feature, and a way for a publication to announce upfront (possibly without scanning the book) that it relies on a feature. We also have to comunicate with epubcheck whether non-XML HTML is acceptable or not in a specific context.

We have a somewhat similar problem with every new feature that are added to EPUB 3.3, albeit non-XML HTML is clearly the most complex change because there is no real fall-back. Adding Digital Comics features, reinforcing WebVTT usage possibly replacing, eventually, SMIL, adding standard annotations, adding rules on dark mode, etc.: they all bear problems in practice.

I go back to the "profiling" idea of @llemeurfr but I propose to reformulate it slightly it to make it more palatable to the community at large. What if we define, beyond EPUB 3.4, and "intermediate" version called EPUB 3.3.1? The various features would be distributed over 3.3.1 and 3.4 along the line of complexity and difficulty to cope with. As a first approximation, I would say:

  • EPUB 3.3.1 : EPUB 3.3 plus DC features, WebVTT, and dark mode rules
  • EPUB 3.4 : EPUB 3.3.1 plus standard annotations and non-XML HTML

This may be a layering that the community would understand. All these are EPUB 3 (with version attribute unchanged), they are all backward compatible, but easier to communicate with. We could of course say we define 3.4 and 3.5 instead, but 3.3.1 would really concentrate on relatively minor additions (compared to EPUB 3.4) and using a third level version number would reinforce that.

I am not sure what it means editorially, ie, whether we would publish a separate recommendation for EPUB 3.3.1 and one for EPUB 3.4; I would prefer not to do that. We can publish a document with a modified title and makes the differentiation within the text. But that is a discussion for later. (Somewhat reminiscent to the conformance levels of WCAG.)

iherman avatar May 21 '25 11:05 iherman

Laurent's idea (profiling) is the thing that "the later we do that the more difficult it might become." In my view, we should start now. And Ivan seems to have missed the point that labels like "EPUB Plus" and "EPUB Basic" are good marketing both to developers of reading systems and to end users. Good labels encourage adoption. Dot-dot releases, by contrast, scream "don't need to pay attention" to users, but also communicate improvements in solidity and stability. That usually means no new features.

eshellman avatar May 21 '25 14:05 eshellman

What I find interesting is that we keep hearing that reading systems aren't actually processing xhtml content documents using an xml parser but (many, most, all?) are already use an html parser in their web view.

If that's the case, then do the warnings about xml syntax support matter -- what breaks in the web view if the content is already treated as the html syntax? It sounds like what breaks are primarily the infrastructure around ingestion, distribution, and loading into reading systems.

I still think profiling of reading systems is a dead end because reading systems probably aren't likely to be aware of what they don't support and who is going to announce all their incompatibilities to users if they are? It's like when we were discussing having publishers report non-conformance to accessibility standards. They're just not going to say anything if they don't comply. Users might notice a format name change but that's as far as I'd assume any technical knowledge on their part goes.

But I'm still waiting to hear what other feedback we get before getting too far into the weeds of how we do this.

mattgarrish avatar May 22 '25 12:05 mattgarrish

In Thorium, EPUB3 XHTML5 documents are handled "natively" by Chromium webviews from XML source (not from HTML5 markup), as dictacted by the combination of .xhtml file extension and HTTP content-type header (application/xhtml+xml). By "handled" I mean parsed to an internal DOM representation which is still tainted by XML / non-HTML traits such as namespaces. By "handled" I also mean the rendering stages which involve CSS selector matching (again, XML namespaces are processed specifically, for example the epub:type attribute). Thorium also parses EPUB3 XHTML5 resources before feeding them into webviews, in order to inject reading-system styles and scripts. Similarly, Thorium parses the EPUB3 XHTML5 Navigation Document outside of a webview rendering context altogether (i.e. just to extract the navigation structures than will then be displayed in the application GUI).

Consequently, enabling support for HTML in Thorium will require identifying all the code paths where XML is currently expected, such as where XML parsers are used, or where some Javascript or CSS logic is taylored to process XML.

danielweck avatar May 23 '25 10:05 danielweck

I have seen the failure of strict/frame/transitional of HTML as well as that of strict/transitional of OOXML. I am thus skeptical about profiling.

The rest of this comment was written by Google Gemini.

That's a perfectly valid and insightful skepticism. Having witnessed the struggles of HTML's strict/transitional/frameset DTDs and the effective non-adoption of OOXML's Strict profile, it's natural to question the practical efficacy of profiling in document standards. Your observation is astute: while theoretically appealing, actual adoption often falls short.


Why Profiling Often Fails in Practice

Your skepticism stems from real-world outcomes that highlight key challenges in standardizing with multiple profiles:

  1. The Overwhelming Force of Backward Compatibility: This is arguably the biggest factor. When a dominant technology or software vendor (like Microsoft with Office, or browsers with HTML) has a massive installed base of existing content, ensuring backward compatibility becomes paramount. Creating a "strict" profile that breaks this compatibility, even for theoretical purity, is a hard sell. Users and developers will almost always gravitate toward what "just works" with their existing files and workflows.
  2. Vendor Inertia and Default Behavior: The default settings of popular software heavily influence adoption. If a major software vendor continues to output the "transitional" or less strict profile by default, as seen with Office and OOXML, then the "strict" profile, no matter how well-designed, remains a niche or academic exercise.
  3. Complexity for Developers and Users: Requiring developers to explicitly choose and manage different profiles adds complexity. It means more testing, more conditional logic, and potentially more fragmentation in the ecosystem. For end-users, the distinction is often meaningless; they just want their document to open and look correct.
  4. "Good Enough" Often Wins: In many practical applications, the benefits of strict compliance (e.g., perfect future proofing, absolute semantic clarity) are outweighed by the immediate need for functionality and broad compatibility. The "transitional" or less constrained version often proves "good enough" for the vast majority of use cases.
  5. Ambiguity of Purpose: Sometimes, the reason for profiling isn't clear enough, or it tries to serve too many masters. Is it for migration? For specific niche applications? For future-proofing? If the value proposition for the stricter profile isn't overwhelmingly clear and immediately beneficial to a large user base, it struggles.

When Profiling Does Make Sense (Like PDF/A)

Despite these failures, PDF/A's success in its specific niche offers a counterpoint, suggesting that profiling isn't inherently flawed, but its applicability is limited to certain contexts:

  • Specific, High-Stakes Use Cases: PDF/A thrives because it serves a very clear, non-negotiable need: long-term archival and legal preservation. The trade-offs (loss of interactivity, larger file sizes) are acceptable because the primary goal (guaranteed future rendering and authenticity) is so critical in fields like government, law, and finance.
  • External Mandates/Regulations: Often, the adoption of a strict profile like PDF/A is driven by legal or regulatory requirements. When there's a compliance mandate, organizations are compelled to adopt the stricter standard, even if it adds friction.
  • Controlled Environments: PDF/A is often used in more controlled environments (e.g., content management systems, digital archives) where the documents are processed, validated, and stored according to strict protocols, rather than being informally shared among general users.

Rethinking Profiling: Lessons Learned

Your skepticism is well-founded and provides valuable lessons for future standard design:

  • Prioritize a Single, Robust Core: Instead of fragmenting a standard into multiple, often overlapping profiles, it's often better to focus on defining a single, robust core that meets the vast majority of practical needs. This core should be flexible enough to evolve but strict enough to ensure interoperability.
  • Extensions Over Profiles: If specific, niche functionality is required that would break the core's simplicity, consider defining these as extensions rather than separate profiles. Extensions can be opted into as needed without fragmenting the fundamental standard.
  • Real-World Utility: Any proposed "strict" or specialized profile must demonstrate a clear, compelling, and widespread practical benefit that outweighs the adoption hurdles. If the benefit is primarily theoretical or only applies to a tiny fraction of users, it's unlikely to succeed.
  • Evolutionary Paths, Not Parallel Universes: Standards should offer clear evolutionary paths that encourage migration to newer, better practices, rather than creating parallel "universes" of documents that are technically distinct but semantically identical. HTML5, by consolidating DTDs and focusing on practical parsing rules, learned this lesson well.

Your observation is a powerful reminder that in the world of technical standards, the "best" or most "pure" solution on paper doesn't always win. Usability, backward compatibility, and practical adoption by dominant players often dictate the real-world standard.

P.S. This text was generated through discussions with an AI assistant to help me organize my thoughts. Particularly, as a non-native speaker, the AI's assistance has been invaluable in enabling me to express complex concepts and nuances more clearly and effectively in English. Given my extensive involvement in debates like "P is a P is a P" and EPUB 3.1 and the maintenance of OOXML as the convenor, I believe my experience with the practical challenges of profiling is particularly relevant here. The points and direction presented reflect my own long-held observations and experiences. I welcome all your comments and look forward to a lively discussion.

murata2makoto avatar May 25 '25 21:05 murata2makoto

As one of the developers of Sigil, a free, open source epub editor, I think that fracturing epub3 into two groups: those that support xhtml parsing rules and serializations, vs those that only work on html, is probably the worst decision this group could ever make.

It will kill any further adoption of epub3 and provide yet again more ammunition for publishers to move away from the epub format. Both publishers and authors want one thing, and one thing alone, a single very very stable format that reaches the biggest installed base of e-readers.

Have you actually read the whatwg parsing rules for modern living html? It is a twisted nightmare of special cases, adoption algorithms, corner cases, and spaghetti logic.

That is why the only people who implement them are browser engine developers. The remainder of the epub production industry uses xml parsers for their simplicity, accuracy, speed, and portability.

And as for content development, at Sigil we encourage authors and epub developers to write their book in Word (with consistent style rules) or InDesign or LibreOffice, and save it as html. Then we use a gumbo based html parser to parse that soup into the dom tree but then serialize it back into xhtml. All of this can happen on the fly as the html files is being imported. That way, inside Sigil normal xml based tools, and regex search and replace tools can be used, javascript added, images worked for size or wrappered in svg, and etc. Then we encourage them to make their epubs as backwards compatible with epub2s as possible, by auto creating an NCX and opf guide out of the nav, using intelligent css rules designed to handle fallback, etc. so that is can still be read and enjoyed by older epub e-readers but still look good on nice epub3 readers like Thorium.

Please do not allow the epub3 spec to fracture any more by allowing a division of readers that do or do not handle html. It will kill any further adoption of epub3. If you feel you must go that way, push it completely out of the epub3, spec and create a epub4 spec. Then sit back and watch people ignore your epub4 spec because once again you would have ignored backwards compatibility with epub2 and 3 for no truly good reason.

kevinhendricks avatar Jun 04 '25 02:06 kevinhendricks

@kevinhendricks, I understand the arguments, and they will be discussed further on the issue list, I am sure. At this moment, I have only two comments, just to make the argumentations precise on all sides.

First:

[…] I think that fracturing epub3 into two groups: those that support xhtml parsing rules and serializations, vs those that only work on html, […]

The WG does not intend to allow reading systems "that only work on html". The direction the WG is taking is to allow authors to produce content documents in HTML that is not necessarily serialized in XML. In other words, if an author decides to publish in XHTML, the RS must be able to render it just as it does today. There is no intention to fracture into two groups.

Also:

Have you actually read the whatwg parsing rules for modern living html? It is a twisted nightmare of special cases, adoption algorithms, corner cases, and spaghetti logic.

I am not in the business to defend or to criticize the living HTML parsing rules. However, consider what the EPUB 3.3 specification says (in §6.1.2):

MUST be an [html] document that conforms to the XML syntax.

With the normative references leading to the official HTML specification (both for the rendering and for the XML syntax). In other words, conformant EPUB 3.3 readers are already required to use the aforementioned parsing rules. (This has been the case at least since EPUB 3.01, see §2.1 of EPUB 3.01.) The current spec just restricts the syntax of the content document to XML; the current proposal is to lift this restriction.

(Yes, the term XHTML in the EPUB spec text is actually a misnomer.)

iherman avatar Jun 04 '25 07:06 iherman

As soon as you allow html code to be used in what is called an epub3, you have fractured the e-reader installed base for that format. How can you not see that?

In addition you will have also hugely disrupted the use of xml parsers used in pre-production, search and replace, editing, etc. There is no stability, no backwards compatibility, etc. Actual book text content can still be generated by authors in word, or libreoffice (xml) or html and converted automatically. So "authoring content" is in no way limited by keeping xhtml parsing rules.

And please understand the Chromium engine used for current e-reading devices is set to currently accept xhtml serialization and content is "served" that way even if locally delivered. It is parsed that way and xhtml parsing errors are strictly detected and even shown at the top of the file when displayed. So internal changes to how content is delivered would be needed in established e-reading systems. And no matter the e-reader, updates to the engine used in the software means e-reading devices would need firmware updates, something quite error prone and costly to do.

If you want to play around with allowing html as a final format for epub content, please do NOT add that to epub3.x spec. It is an e-reader and user fracturing change that belongs in a new spec (that will as I said be sadly ignored for long long periods of time). Trying to sneak it in as "allowed" but not "forced" changes nothing in this argument. It does not induce broader acceptance, nor stability, just the opposite.

kevinhendricks avatar Jun 04 '25 13:06 kevinhendricks

I suggest that before adoption, we produce some EPUB3 files with HTML content paired with the same files produced with XML. Then test them side-by-side on a variety of unmodified reading systems. If this has been done, what are the results? If no tests have been done, it seems to me irresponsible to move forward with the change at this time.

eshellman avatar Jun 04 '25 16:06 eshellman

alice_html.epub.zip

I have uploaded a zip archive of the modified version of Alice through the Looking Glass (public domain) epub. It is modified as follows:

  1. Stripped down to just chapter 1 and chapter 2
  2. All ending </p> tags have been removed the files, which is normal for html but will result in a not well formed error under xhtml
  3. Once spot in each chapter overlaps two inline tags <b><i></b></i> and one for <em><i></em></i> which again is perfectly legal and normal in html but not allowed in xhtml.
  4. the existing xml header has been removed
  5. simple html tag (no additional namespaces added)
  6. and the content.opf manifest lists the files under the media type "application/html"

And every modification I have made is extremely tame when compared to the mess that html can generate. I did not add any attributes with no values, did not extremely over lap tags, did not allow bare text in the body tag, etc. In other words this is the simplest and easiest test to pass that you can imagine.

I then tried this in the flagship Thorium e-reader (and fwiw - thank god for Readium/Thorium - they are the one true force moving epub3 forward) and it immediately tried to write out the offending files (opened up a file save dialog) but never showed anything.

If they can not handle it, I don't think anything but the open source calibre reader will accept it (I have not tested it) as calibre ignores the epub spec and pretty much just tries to show something always.

Please look it over and give it a try. You will not see very many unmodified readers especially from commerical existing e-readers will deal with it.

Interestingly enough I loaded it into Sigil (with its Mend on Open feature enabled), and Sigil both alerted me to the missing namespaces and unknown media-types, and incorrect mark-up and fixed it on the fly on load (whew!).

Hope this helps.

kevinhendricks avatar Jun 04 '25 18:06 kevinhendricks