html icon indicating copy to clipboard operation
html copied to clipboard

More semantic way to represent the computer language in `code` element

Open IgorGilyazov opened this issue 3 years ago • 5 comments

https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-code-element

There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.

The similar problem is resolved in meter element via the title attribute:

There is no explicit way to specify units in the meter element, but the units may be specified in the title attribute in free-form text.

For consistency can do the same with the code element:

<pre><code title="pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>

IgorGilyazov avatar Apr 28 '22 21:04 IgorGilyazov

Don't think that the title should be used to define the code language, the title could describe what the code dose. maybe a language attribute would be better?

<pre><code title="Script to ask user for a display name" language="javascript">
prompt('Choose a username')
</code></pre>

I have also used highlight plugins I also think a classname such as language-js is terrible, a better solution would have been data-language="js" where it could have been picked up with elm.dataset.language or something like it instead

jimmywarting avatar May 08 '22 19:05 jimmywarting

<...> the title could describe what the code does.

Personally I completely agree with that. However, the title attribute can represent any sort of additional information, that's why the spec states it can be used to specify units for the meter element. Thus, for consistency I suggest to use it to specify language for the code element.

<...> classname such as language-js is terrible

While class name is encouraged to be informative, it is not the right tool to add additional semantics. The more appropriate tool for such task is microdata. For example, let's consider programmingLanguage property of SoftwareSourceCode type:

<div itemscope itemtype="https://schema.org/SoftwareSourceCode">
  <meta itemprop="programmingLanguage" content="javascript" />
  <code>console.log("Hello, World!");</code>
</div>

It works, but the code is too bloated for such a simple task. A distinct attribute like language would be ideal, but do we really need to add a unique attribute ?

<...> a better solution would have been data-language="js"

While being a viable solution, user-defined attributes are too generic and can be used to represent anything:

<code data-language="lisp" data-author="John Doe" data-platform="gnu clisp 2.49" data-version="4.2">
(format t "Hello, World!")
</code>

IgorGilyazov avatar May 25 '22 21:05 IgorGilyazov

I suggest adding a codelang attribute for this.

Example:

<pre><code codelang="pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>

gjvnq avatar Jun 13 '22 01:06 gjvnq

<code type="text/x-pascal”>...</code>

Valid MIME type string is defined, type attribute is already defined in other places.

Sudrien avatar Aug 27 '24 23:08 Sudrien

I agree with @Sudrien's proposal to introduce a type parameter for code. I've described the existing workarounds at https://github.com/whatwg/html/issues/11370#issue-3136480634, and don't see them being viable, nor do I see any reason to cause another src/href inconsistency by introducing another parameter name.

However:

  1. The fallback process should be documented. I presume that it would be text/plain by default, but can imagine that some implementers would want to be able to heuristically identify the content type. ^1

  2. Would type render lang="mis" unnecessary, or would I merely be replacing lang="mis" class="language-example" with lang="mis" type="example/example", or would the presence of a type parameter (or the tag itself) imply lang="mis"?

RokeJulianLockhart avatar Jun 11 '25 15:06 RokeJulianLockhart

@RokeJulianLockhart

With your details

  1. I'm missing the reasoning behind having a default value for an optional attribute? In fact, a null value seems useful in scenarios where language detection may happen.
  2. lang= would inherit from the global attribute unless specified - and is generally used in reference human to human communication. type= would be machine languages. The only reason I'd think there would be to set a lang= attribute on a code block is it contained comments in a different language than it's parent. Like,
<html lang="en">
  <p>the code</p>
  <code type="application/x-sh" lang="fr">
    #!/bin/bash

    sudo dmesg | less
    # c'est une pipe
  </code>
</html>

And it's definitely possible there simply is no human readable language in a code block.

Sudrien avatar Jul 09 '25 04:07 Sudrien

@Sudrien, code, by default, doesn't contain comments. They're an exceptional addition. Some languages don't even support them.

Additionally, although some programming languages might be based upon non-English languages, or non-Latin scripts, what language is utilised generally has no relevance to the overall language of the document. Consider how most of the Mandarin world still writes their code in C++ and their scripts in Python or PowerShell, none of which have an alternative in their native script, despite it being technically feasible.

I was thinking in that context. That's why I thought it should default to lang="mis", or similar, since at least Firefox treats that as no language. Otherwise, wouldn't a screen reader need to special-case code?

I'll quite happily acquiesce if you've reason to believe that this should be based upon the comments, but since so few code examples contain them (in comparison to those which don't), I believe that it should be explicit.

RokeJulianLockhart avatar Jul 09 '25 10:07 RokeJulianLockhart

In the linked #10580 marked as a duplicate there is also...

Ideally this (might?) allow browsers to read programming languages in code blocks and add syntax highlighting to the default user agent stylesheet, but I recognise that's a very ambitious outcome for this proposal.

Just sort of curious if the aim is limited to semantically communicating something or is there an intent to get to syntax highlighting or something as well?

bkardell avatar Jul 09 '25 11:07 bkardell

https://github.com/whatwg/html/issues/7869#issuecomment-3052372817

@bkardell, it'll be utilised in that manner regardless by consumers, considering that the WHATWG and BCP recommendations are so fragile in comparison.

RokeJulianLockhart avatar Jul 09 '25 12:07 RokeJulianLockhart

@RokeJulianLockhart I am asking what you are requesting a browser itself would do with this information? Anything?

bkardell avatar Aug 04 '25 14:08 bkardell

@bkardell, I'm not the author of the comment you linked to. However, I also presume exactly what you surmised shall, in practice, occur. Enough Electron-based text editors render syntax highlighting in a remarkably lightweight manner for markup supersets like text/Markdown (CommonMark) that I would expect some implementers to implement specific syntax highlighting predicated upon the presence of the type parameter, which would differ based upon its argument.

RokeJulianLockhart avatar Aug 05 '25 03:08 RokeJulianLockhart

I'm really just trying to figure out what specifically this issue is asking for... Is it semantics so that people can build things atop - recognize different languages embedded for ML, a reverse proxy that does some kind of rewriting for formatting, or so that you can author extensions that do that, or to launch associated REPLs for example - just given a standard way to annotate them, you could do that those and a lot more. But, as already stated - I think we can already do that much today too by sharing schema.org. The main thing here is making it shorter, I guess?

OR... is the main idea that browsers would just natively display them with syntax highlighting? Those seem like very different asks to me.

bkardell avatar Aug 07 '25 20:08 bkardell

@bkardell, thanks. That makes it clearer.

I want to be able to consistently annotate with a MIME type, such that a consumer can recognise a language: your first option. I didn't know of a more extensible way to annotate these, but that may be because I'm unfamiliar with some of the use cases you've proposed.

Tangentially, I would like to see this as a feature of the browser, since text editors do it so damn well. However, I recognise that that's outside the scope of this: it's a dicussion for Bugzilla. Regardless, the second use case is dependent upon the first.

RokeJulianLockhart avatar Aug 07 '25 20:08 RokeJulianLockhart