commonmark-spec i and b vs em and strong

Em and strong aren't always the right thing. Is there a way to get i or b?

I know we love semantics over rendering but em and strong are sometimes hypercorrect. Looking at the html source of someone who wrote in Markdown and seeing em and strong when they clearly meant i or b, or sometimes cite.

For example when introducing a new name or term in another language.

Jun 06 '20 17:06 snan

Implementers (or plugin authors) could decide to use asterisk * for em / strong and underscore _ for i / b or vice versa, but I think it is too late to make this a mandatory change – also it is a distinction rather specific to HTML output.

PS: I prefer the underscore for presentational elements, because it fits well with introducing underlined u when four of those characters are used.

Jun 06 '20 18:06 Crissov

Inkwells have been drained and spools of paper emptied rehashing the arguments for whether / or / should be the default HTML output for the markdown elements. The lack of reference to prior discussions doesn't really endear your argument to anybody, most of the arguments for & against a tag pair cut both ways.

The short version is that this is an HTML distinction that doesn't exist in Markdown and use cases vary so each application may make different choices on how to map the elements to another format. Markdown only has one pair of semantic elements (in spite of having to syntax options) while HTML has two. You can really only map to one of them at a time. Many rendering engines give you the option, or a way to filter tags and write them how you please. Doing both is not really at option at this point for legacy reasons.

Jun 06 '20 19:06 alerque

Personal attacks and assumptions about what I've researched are not appropriate. I'm not new to the topic of what elements Markdown should use. I'm just new to CommonMark specifically.

HTML has two

With one being a not-always-valid subset of the other.

Jun 06 '20 19:06 snan

I wasn't making a personal attack. I did assume since you didn't even hint at knowing any of the background that you might not be aware of it. In any event if you are aware of some of the background then surely you know jumping in with a dogmatic assertion that one set of tags is better isn't going to resolve things.

With one being a not-always-valid subset of the other.

No, one is not a subset of the other. It's more like two competing standards, one focusing more on structural semantics and the other on presentation and legacy. If one was a subset of the other, the superset would always be interchangeable with some loss of meaning. Such is not the case, one could intend a kind of emphasis that was not supposed to be styled with italics or just as well as italics may not be used only for emphasis.

Jun 06 '20 20:06 alerque

I didn't mean to step right into a fight here. I love markdown and I know there is a lot of hard work behind it (and projects adjacent to it) over the years. I wouldn't care about this issue if I wasn't invested in the language and seeing it as the future of markup.

And, I understand that I'm probably a decade too late (or I don't know when the CommonMark project started). It seems pretty locked in. But this language might still be in use fifty years from now. I hope Markdown's future is longer than its past.

The W3C puts it like this in the HTML standard:

The i element represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose in a manner indicating a different quality of text, such as a taxonomic designation, a technical term, an idiomatic phrase from another language, transliteration, a thought, or a ship name in Western texts.

Emphasis is a subset of that.

But yeah, their description of b isn't a superset of strong emphasis on the other hand, although historically b has been used that way.

I tried to be careful when phrasing the original post in this issue thread. I wrote "Is there a way to get i" rather than "em is always wrong".
Seeing

<em>Hamlet</em> is a great play. It has a certain <em>je ne sais quoi</em> about it.

in source code is not correct. Seeing

"Wow, this <i>really</i> is a warm day!"

has legacy precedent and is not wrong per se, even though I'd rather use em there.

Markdown by its very nature as a plaintext-adjacent, "email tradition based" language does have its roots in presentation and legacy rather than structural semantics, which is why I feel that a way to easily make i or b is appropriate. Just as easily as I would write

_Hamlet_ is a great play, it has a certain *je ne sais quoi* about it.

in an email.

But it's for the sake of structural semantics that this matters. A web-scrapin' robot going through web pages and trying to scan for what is emphatic text might go "Wow, people back in the early 21st century really got stoked about their French phrases!"

Yes, i or b is sometimes a little blunt, sometimes a little non-specific, sometimes not optimal. But when I overuse em or strong that's sometimes flat out a falsehood, sometimes flat out what I don't mean.

I've seen some semantical horror stories over the years, like people using h1 for centering images (because that particular forum had h1 headers centered) or h6 when semantically they should've used h1 "because they think the smaller letters look cuter". There's no way to stop all such misuse, but we don't have to build it into the language, either.

Jun 06 '20 20:06 snan

You're talking to somebody who uses custom Pandoc filters to overload *...* and _..._ to render to different markup, *_..._* to render a third kind of highlighting that is not those two nested nor is it /. I realize there is a use case for more precise semantics than Markdown provides for, I just don't think there is any way to codify them as "common". This project isn't about inventing new markup (if anything it's a downgrade from the Pandoc + extras flavor I use) but about standardizing the exactly usage of ambiguous bits.

The way you get  instead of  is by requesting it from your tooling. The spec and reference implementation default to one pair of tags. I believe the reason for that is that it's more flexible semantic markup with less direct locking to a specific presentation style, but you'd have to lookup the actual discussions to see if that was the deciding factor. Since Markdown has no way to differentiate between them it is outside the scope of CommonMark to specify when to use which. You can either use tooling that lets you pick or post-process the output.

Jun 06 '20 21:06 alerque

To go back to the first question:

Em and strong aren't always the right thing. Is there a way to get i or b?

I do think they often are the right thing, but indeed, they aren’t always. To get  or , you can type those in literally, because both original Markdown and CommonMark have a “breakout” feature where it accepts HTML. As others have said, tools could allow options (I particularly like @Crissov’s idea: https://github.com/commonmark/commonmark-spec/issues/652#issuecomment-640102691)

I do think it’s unfortunate that CommonMark does not say anything about semantics. And that its definition (“6.4 Emphasis and strong emphasis”) is not aligned with HTML (In HTML, nested emphasis is used for “strong” emphasis, whereas the strong element means importance, seriousness, or urgency).

Jun 07 '20 08:06 wooorm

What the CM reference implementation should do, though, is to retain information in its AST about which character has been used in the source.

PS: https://talk.commonmark.org/t/em-strong-vs-i-b-or-cite-dfn-etc/1242 https://talk.commonmark.org/t/revisting-underline-healthcare-documents/3078/3

Jun 07 '20 08:06 Crissov

@snan I agree with most everything if not everything you said. I agree with the following in spirit

It seems pretty locked in. But this language might still be in use fifty years from now. I hope Markdown's future is longer than its past.

but I don't think Markdown will last even another ten unless it evolves*. It's still mostly used by technical types, myself included, who are comfortable writing for machines -- that is to say, quite used to and adept at thinking in terms of, What do I need to do to get the machine to do what I want?. Markdown was definitely a step in the right direction away from HTML for authoring. But we need to make more steps.

*I'm not sure it can. I think/hope something Markdown-like will replace it. A bit of a reboot is necessary.

Jun 09 '20 00:06 vassudanagunta

I do think they often are the right thing, but indeed, they aren’t always. To get  or , you can type those in literally, because both original Markdown and CommonMark have a “breakout” feature where it accepts HTML.

I was under the (mistaken?) impression that that was for HTML output only; i.o.w. it's more of a "passthrough" feature than a "breakout" feature. In pandoc 2.5, which I have at hand, when compiling the text to LaTeX, it just drops those  tags.

somebody who uses custom Pandoc filters

♥

I do that too, on my own system, (lua ftw) but the reason I just found out about CM is that Stack Exchange announced that they are going to adopt it and I was like "OK, so it's no longer Gruber that I have to go bug about this".

Jun 09 '20 06:06 snan

I think/hope something Markdown-like will replace it. A bit of a reboot is necessary.

I'm seeing a lot of sites switching over to wysiwyg or wysiwym but I'm not wholly on board with that. I love markup languages.

Jun 09 '20 06:06 snan

In HTML, nested emphasis is used for “strong” emphasis, whereas the strong element means importance, seriousness, or urgency

Wow, so it is a subset of b after all!

Jun 09 '20 06:06 snan

I was under the […] impression that that was for HTML output only […] In pandoc 2.5, which I have at hand, when compiling the text to LaTeX, it just drops those  tags.

That impression is correct: though when going to LaTeX, it doesn’t really matter whether ,  or something else is used, no? These semantics matter when going to HTML, in which case, HTML tags are fine?

Wow, so it is a subset of b after all!

The HTML spec also says on :

The b element represents a span of text to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood, such as key words in a document abstract, product names in a review, actionable words in interactive text-driven software, or an article lede. […] The b element should be used as a last resort when no other element is more appropriate.

I don’t think I agree that it’s good to see  as “inheriting” from / subset of . The last quoted sentence especially makes it sound to me as if defaulting to  is a worse approach.

Jun 09 '20 08:06 wooorm

That impression is correct: though when going to LaTeX, it doesn’t really matter whether ,  or something else is used, no? These semantics matter when going to HTML, in which case, HTML tags are fine?

It just drops the tags. So if you want to publish to both TeX and HTML you're sol if you use  tags.

The b element should be used as a last resort when no other element is more appropriate.

This language to me also implies fallback, catchall, default. When you can be more specific, you should. With a visual/presentation based markup like the email-derived **asterisks** you can't be that specific, and you can't easily select an appropriate element. Not that the W3 spec's specific wording is the be-all-end-all of my argument here, that'd be taking "appeal to authority" a bit far. History, legacy, intent and spirit of the HTML language is also relevant.

Jun 09 '20 08:06 snan

As alluded to upthread, we know that through the life-changing magic of CSS, em and strong aren't strict subsets of i and b respectively. You can style it to use underlining or small caps to emphasize. So I don't mean a strict subset, I mean… kinda a subset. It's correct to say that one is semantics and the other is presentation.

It's just that

the semantic elements cover fewer use cases. And very many of those use cases are a subset of the much larger set of use cases that the presentation-based elements cover.
markdown's historical precendent "plaintext email formatting" is a presentation-based language.

I'm not disputing that we want semantics. I just don't want wrong semantics.♥

Jun 09 '20 09:06 snan

I'm also definitely not saying that the solution is that markdown's output for em and strong should instead always be i and b. I've tried avoiding taking that position in this thread. It's what I would do, but I realize that that's a compromise with some serious downsides, and I'm open to other solutions.

Jun 09 '20 09:06 snan

It just drops the tags.

That to me sounds like a Pandoc problem, which I was under the impression could turn HTML into TeX.

I'm also definitely not saying that the solution is that markdown's output for em and strong should instead always be i and b.

I do see that you never proposed that in posts; but to me the title of this issue, “i and b vs em and strong”, pits them against each other.

I do think em and strong are better defaults than i and b, but I recognize they aren’t always. I would say that CommonMark talking about semantics is an acceptable solution, ushering users to care about semantics instead of presentation. And that i and b created according to @Crissov’s suggestion would be a welcome addition in userland.

Jun 09 '20 10:06 wooorm

That to me sounds like a Pandoc problem, which I was under the impression could turn HTML into TeX.

Pandoc can indeed convert HTML to LaTeX. However, here the input format is Markdown, and pandoc drops raw HTML when rendering to non-HTML formats. (This behavior is at least sometimes what you want.)

However, you can always use a lua filter that converts these raw HTML nodes to something that makes sense in your target format.

Jun 09 '20 16:06 jgm

I love lua♥ We also discussed this for pandoc specifically over on pandoc's issue tracker.

Jun 09 '20 16:06 snan

Here's just one idea (just green hat brainstorming for a solution here):

What if  and  and </cite>, and their respective opening tags (attributes could be dropped) could be elevated to be part of the language instead of seen as passing HTML through?

Jun 19 '20 07:06 snan

I prefer the underscore for presentational elements, because it fits well with introducing underlined u when four of those characters are used.

Intuitively I feel the same way; I usually do think "emphasis" when I use the asterisks, and do usually think presentationally cursive when I use the underscore (sometimes that part of my brain is sloppy and thinks presentationally cursive when it should be thinking emphasis ← wow, I just did it in this sentence involuntarily, those were underlines just then).

However, in some implementations asterisks work inside of words like this and underscores don't, like t_hi_s. Are people more likely to use presentational cursive in words or emphasis? I guess emphasis so this paragraph isn't much of a "however" and instead should be an "additionally" since I come down on the same divide as you do, Crissov.

And that it might be too late to change, I wouldn't know if that was true. Crystal ball is on the fritz over here

Jun 19 '20 08:06 snan

All proper implementations of Commonmark support asterisks inside words (but not underscores), while only some implementations of Markdown do.

Jun 19 '20 09:06 Crissov

Which only strengthens I was trying to say about that, rather than contradict it.♥

Jun 19 '20 09:06 snan

What if  and  and </cite>, and their respective opening tags (attributes could be dropped) could be elevated to be part of the language instead of seen as passing HTML through?

Oh, I just saw that this is getting downvotes. And I'd rather have find a perfect solution than a compromise that no-one is truly happy with, but it's frustrating that we aren't getting anywhere nearer a solution here. For those who use pandoc for html only, it's not a big problem because they can do manual italics but it's difficult for when we want to use the same source documents for standards compliant HTML and for ConTeXt or LaTeX.

Jul 09 '20 12:07 snan

A month has passed and I find I'm OK with writing ,  and <cite> manually. It's a strength of Markdown that the non-specific syntax is easy to remember and that there is redundancy. Textile and YAML are stressful to write for me, languagues where I need to get everything just so, while Markdown is chill.

However, there are many times where the i, b or cite is getting lost. On Reddit, on Stack Exchange, and sometimes in Pandoc. That's why I wanted i, b and cite to become "part of the language" or at least some sort of recommendation that implementations don't throw away this information.

Aug 22 '20 19:08 snan

Implementers (or plugin authors) could decide to use asterisk * for em / strong and underscore _ for i / b or vice versa

~~It would be good if this was made explicit.~~ I misunderstood what was said here. I thought it said implementers can decide to use */_ for i and **/__ for b. That's what I want.

Here is a thread where that has been an issue.

You're talking to somebody who uses custom Pandoc filters to overload *...* and _..._ to render to different markup

That doesn't help someone who is posting on Reddit or Stack Exchange or the hundreds of other sites where these render into em. CommonMark implementations make the web full of Gingko biloba is a tree never mentioned in Romeo and Julia`.

Example 393 in CommonMark's own spec is evidence of this. The call is coming from inside the house! 😱

Mar 27 '23 06:03 snan

That's why I wanted i, b and cite to become "part of the language" or at least some sort of recommendation that implementations don't throw away this information.

If by “part of the language” you mean a new syntax, I’d probably be against it. I worry that the grammar will become too crowded. Depending on what design you come up with, it’s either likely easy to type, which will also mean that it would break lots of existing markdown. Or it’s complex to type, but then I’d prefer something like generic directives.

For a recommendation, I dunno. But I recognize that lots of people are looking for recommendations on what to do, but the spec currently doesn’t want to get involved in those decisions. So maybe an appendix for such things might be useful. For example talking about semantics (ref: https://github.com/commonmark/commonmark-spec/issues/652#issuecomment-641194597)

I misunderstood what was said here. I thought it said implementers can decide to use */_ for i and **/__ for b. That's what I want.

I’m not quite sure what you‘re saying, to phrase it differently: I am in favor of implementation adding options to use i instead of em, and b instead of strong: https://github.com/commonmark/commonmark-spec/issues/652#issuecomment-641194597.

I don’t think we need to add that everywhere in the spec though. I don’t think we need to describe that implementations are free to use div instead of p, or h2 instead of h1. Etc.

Example 393

Can you clarify what you don’t like about that example?

Mar 27 '23 07:03 wooorm

Example 393

Can you clarify what you don’t like about that example?

Sure, thanks for the question, that's illustrative of the issue so it's good to dig in a li'l deeper:

The example is:

<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn.
<em>Asclepias physocarpa</em>)</strong></p>

That is not correct HTML, which should be:

<p><strong>Gomphocarpus (<i>Gomphocarpus physocarpus</i>, syn.
<i>Asclepias physocarpa</i>)</strong></p>

or, depending on the context, maybe even:

<p><b>Gomphocarpus (<i>Gomphocarpus physocarpus</i>, syn.
<i>Asclepias physocarpa</i>)</b></p>

Linnaean names, like that Latin name for balloon plant used in that example, are always marked italics, cursive, oblique, or otherwise text-decorated, but not emphasized;  is semantically wrong for them. I can not correctly write that plant's Latin name here on GitHub since there is currently no way, that I know of, to emit .

 and  are good fallbacks. They only indicate style. , <cite>,  are for when you specifically wanna indicate the semantics of emphasis, citation, or strong emphasis.

Refering to a poodle as "a dog" is slightly weird but not that bad and it's technically correct (and that's what we're doing when we're using  when we mean ).

Refering to a collie as "a poodle" is, on the other hand, quite wack (but that's what we're doing when we're using  for a Linnaean name or for a citation or for a foreign-language phrase).

And before someone asks: "But we should express semantics, not style. I heard someone back in the nineties say that a lot of the time people are wrongly using  when we should be more precise and use ". Yes, that's true. Em is more specific when i and is better to use—but only when we know for sure that we mean emphasis.

Yes, it's true that  is the most common one. 90% of the time it's what you mean. But just because you are in a town where 90% of the dogs are poodles it doesn't turn a collie into a poodle. A collie is still a non-poodle dog, just like a citation or a foreign phrase is still a non-emphasis use of italics.

I backed off from this argument a few years ago because of this argument: "We support raw HTML so people can type out  or <cite> or  when they mean  or <cite> or , and they can use the shorthand * or _ for the most common one, which is , and ** and __ for the second-most common one, which is ."

But two things are becoming clear to me.

1A. People are using CommonMark-derived converters in places where raw HTML is (and should be) turned off, like on public forums and comment sections.
1B. Implementors of those public forums are referring to this spec saying "I'm just doing what CommonMark says".

\2. Not everyone is, wants to be, or needs to be a linguistics nerd. People shouldn't have to learn the specifics minutia of when to use em, cite, or i. They just want the text to look slanted so they jam stars or underscores around. Making * and _ be  match their expectations.

That's why my recommendation is this:

Sites where raw HTML is turned off (as it should be, for public text inputs) should emit  for * and for _, and  for ** and for __.

Installations where markdown is used as a tool for writers, where it's a shortcut for HTML as opposed to a replacement for it, and raw HTML is allowed, may optionally continue to emit  and  or have a flag for that behavior.

That's what I would use for my own blog where I can type out , <cite>, or  manually, as needed, and most of the time I would get the default, . I just checked, and I use  70% of the time, <cite> 20% of the time, and  10% of the time, so it's appropriate for me to have * and _ emit em since I know to get the others when I need them (I even have an shortcut that I bolted on to Emacs markdown-mode to get them as raw HTML), but even then, that's not necessarily the best for all installations depending on how nerdy the users of that tool are expected to have to be.

Not everyone should have to learn this stuff but that doesn't mean it's OK that the web is littered with wrong semantics like Gomphocarpus physocarpus.

That's more wrong than I'm really tired.

If by “part of the language” you mean a new syntax, I’d probably be against it.

Yeah. It became clear upthread that that particular idea (elevating  and <cite> and  from being seen as raw HTML to being seen as first-class markdown language constructs) was not popular even with those who otherwise agree with me, and I've accepted that that idea is not gonna fly.

I’m not quite sure what you‘re saying

I want to be able to use italics and bold on Reddit, StackExchange, here on GitHub, and dozens of other sites that pass the buck by saying "We're only doing what CommonMark says".

I don’t think we need to add that everywhere in the spec though. I don’t think we need to describe that implementations are free to use div instead of p, or h2 instead of h1. Etc.

It's becoming clearer and clearer to me that we do need to be explicit about that.
C.f. this Comrak pull request.

Summary:

It should be i and b instead of em and strong (at least on most of the websites out there like GitHub, Reddit, StackExchange, wikis etc).

I like that * and _ both mean the same thing, that * can be used intraword and _ can't, etc. That's all good. I just don't want to call collies "poodles".

Mar 27 '23 08:03 snan

Latin name[s] […] are always marked italics

OK, so there is a typographic convention of how things should look. I want to stress that  in HTML, does not mean italics, or in any way how things look. This is perhaps pedantic, but if you want to express italics, use , or for oblique use .  is about “offset[ing] from the normal prose”. Latin names aren’t normal English, that’s why they can be marked as : https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-i-element.

but not emphasized

I don’t see any reason to conclude that Latin names must never be marked as stress emphasis as described by the HTML spec: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element.

I can understand that it might be better to remove things with certain typographic conventions here. Perhaps:

My **cats (*Cheddar* and *Whiskers*)**.

<p>My <strong>cats (<em>Cheddar</em> and <em>Whishers</em>)</strong>.</p>

…might be an improvement. I am sure there’s an example we can think of that you accept that embeds  inside , while not using Latin names.

I can not correctly write that plant's Latin name here on GitHub since there is currently no way, that I know of, to emit .

You can type Gomphocarpus physocarpus here: Gomphocarpus physocarpus.

They only indicate style.

You answer this yourself: “But we should express semantics, not style.” HTML is about semantics, not about presentation.

1A 1B

Sure

2

Using your own terms around poodles and collies: just because 90% of people don’t give a hoot about semantics, doesn’t mean we need to remove all semantics and go with HTML 2 again.

That's why my recommendation is this:

I have supported this: https://github.com/commonmark/commonmark-spec/issues/652#issuecomment-641194597

but that doesn't mean it's OK that the web is littered with wrong semantics like Gomphocarpus physocarpus. That's more wrong than I'm really tired.

I am not sure why you deem one more or less wrong than the other. Both can be right. Both can be wrong.

It became clear upthread that that particular idea ([…]) was not popular even with those who otherwise agree with me, and I've accepted that that idea is not gonna fly.

If you’re interested in a markdown-like language that does make separate tags a part of the language, you might enjoy https://mdxjs.com.

I want to be able to use italics and bold on Reddit, StackExchange, here on GitHub, and dozens of other sites that pass the buck […]

It's becoming clearer and clearer to me that we do need to be explicit about that. C.f. this Comrak pull request.

That’s not what your PR there does. You break CommonMark there by changing everything for everyone. Your feature request is to do this optionally, which is acceptable to the maintainer there: https://github.com/kivikakk/comrak/pull/285#issuecomment-1484817507

Mar 27 '23 09:03 wooorm

 is about “offset[ing] from the normal prose”.

Yes, that's a really good way to phrase what the semantics of  is about! Thank you, that wording makes my case (that we should emit  by default) a lot stronger.

This is perhaps pedantic, but if you want to express italics, use , or for oblique use .

Right, offsetting from the normal prose is what we want, as opposed to a specific visual representation of that offset.

I don’t see any reason to conclude that Latin names must never be marked as stress emphasis as described by the HTML spec: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element.

I mean, they can be part of stress, like you could say “Is that a Tyrannosaurus Rex?” the same way you could say “Is that a spider?”’ but marking them as Latin names is done with , not . It is not correct to use stress to offset them.

I've told this story before but I remember an old social media site (now defunct) in the early 00s and I saw someone had managed to center an image on their profile, something that the dinky markup of the time didn't allow. But when I looked under the hood, I saw to my shock & horror that they had marked the image as a h1, which had cause the site CSS to center it. That's not what h1 means. And that's as bad as using stress to mark latin names. Using italics to mark them, sure. Because the point is to offset them from prose. (I wrote as much in my previous post, saying " italics, cursive, oblique, or otherwise text-decorated". I've seen them underlined in old type-written manuscripts and that's fine too, for example. And +CSS can do that.)

I can understand that it might be better to remove things with certain typographic conventions here. Perhaps:
My **cats (*Cheddar* and *Whiskers*)**.
My cats (Cheddar and Whishers).
…might be an improvement.

That is not better. That'd be super weird, semantically, to stress their names that way.

I am sure there’s an example we can think of that you accept that embeds \<em\> inside \<strong\>, while not using Latin names.

Yeah, I can think of a few. The other examples use text like foo and bar and that'd be fine here. Using names is not good.

You can type Gomphocarpus physocarpus here: Gomphocarpus physocarpus.

If that's true, GitHub is not an applicable example for this problem. But there are many other sites out there where that's not possible because they have turned off raw HTML, and, there are also users who don't understand (and shouldn't have to understand) when to use which.

Using your own terms around poodles and collies: just because 90% of people don’t give a hoot about semantics, doesn’t mean we need to remove all semantics and go with HTML 2 again.

Cite and em was added to HTML at the same time as i and b was, with HTML 2 (as you know, since you linked to the HTML 2 RFC which does mention em and strong).

That's why my recommendation is this:

I have supported this: #652 (comment)

but that doesn't mean it's OK that the web is littered with wrong semantics like \<em\>Gomphocarpus physocarpus\</em\>. That's more wrong than I'm really tired.

I am not sure why you deem one more or less wrong than the other. Both can be right. Both can be wrong.

Being overly broad is less wrong than being specific-but-wrong. Calling a poodle a "dog" is less wrong than calling a collie a "poodle".

If you’re interested in a markdown-like language that does make separate tags a part of the language, you might enjoy https://mdxjs.com.

The problem isn't my own websites where I have control over what Markdown implementations to use. I personally already have a setup that lets me write my choice of em, cite, i, strong, or b.

The problem is sites like Reddit, StackExchange and many, many others where A: users have no way to type i or cite as distinct from em, and B: they shouldn't have to, they shouldn't have to learn to do that nor to learn to understand hyper nitty-gritty semantics perfectly. And there's no way to automatically detect when they mean cite or em or i so I propose we use i. Offsetting from prose is what they want, even though they might mean to do that offsetting for emphasis purposes 70% of the time.

In hindsight it was a bad idea for HTML to create em and cite and strong tags because they presupposed every single formatted text online needs to go through an editor with enough linguistics chops to distinguish between which to properly use when. That's fine for institutions but an unreasonable requirement for a discussion site or other public-writable spaces.

I'm a linguist—I can nerd out enough to know when to use em and when to use cite and when neither is applicable and I need to use the superset, i. And even then I make mistakes every now and then—but I'm not a biologist so, if to reuse the poodle/collie example: if every website like Reddit or StackExchange required me to use one syntax when talking about poodles, one syntax when talking about collies, and another when talking about non-poodle non-collie dogs, I'd be in trouble.

Letting * and _ be  lets people keep using * and _ in the way they think they are alreaday using them. To offset from normal prose.

That’s not what your PR there does. You break CommonMark there by changing everything for everyone. Your feature request is to do this optionally, which is acceptable to the maintainer there: kivikakk/comrak#285 (comment)

The maintainer hadn't written that response yet when I posted here.

I think the default should be i and b, with em and strong being tucked away as an option (only to be turned on by people who know exactly what they are doing and who can emit cite and i and b by other means, such as raw HTML).

Changing everything for everyone is the point. There's a lot of collies marked "poodle" out there on the web. If they can be turned into "dogs" that'd be a win for semantics.

These sites look to CommonMark as an authority on this. They're like "we're emitting em and strong because that's what CommonMark tells us to do". That wasn't necessarily CommonMark's intent—which was more to clarify the specifics of nesting and overlapping and so on—but that's what has happened and that gives CommonMark a responsibility clear this up.

Mar 27 '23 10:03 snan

commonmark-spec commonmark-spec copied to clipboard

i and b vs em and strong

commonmark-spec
commonmark-spec copied to clipboard