djot icon indicating copy to clipboard operation
djot copied to clipboard

Metadata

Open jgm opened this issue 1 year ago • 60 comments

Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?

If so, what?

Do we need structured keys such as YAML provides? Would be nice to avoid the complexity of YAML, but otherwise YAML is nice for this. Maybe some simplified subset of YAML.

jgm avatar Jul 31 '22 18:07 jgm

Is the purpose of the metadata block to set variables in the standalone output doc template? (I'm thinking here of my rough understanding of how Pandoc works.)

My understanding is that YAML is a rather complex format. What about TOML?

uvtc avatar Aug 01 '22 01:08 uvtc

I don't much like TOML for this purpose; it requires you to quote strings, and it makes it very inconvenient to represent e.g. an array of references.

jgm avatar Aug 01 '22 02:08 jgm

YAML has bad handling of anything including newlines. So "simplified subset of YAML" would not solve the issue TOML does solve. But I agree that quoting stuff is annoying.

One thing that I am missing is to spread metadata across the document - for longer documents one loses context if you have to put all the metadata at the very beginning of the document. So my requirement would be support multiple metadata blocks instead of just one.

Btw. how about just "reusing" existing formatted blocks with a reserved format keywoard (might be a "symbol" instead latin text)? Instead of cpp one would use perhaps > or # or whatever and done.

dumblob avatar Aug 01 '22 08:08 dumblob

I needed a way to load configuration into my Pandoc filters without needing to "revert" Pandoc metadata trees to "plain" data. I couldn't use an existing YAML parser and soon despaired about writing a parser for full YAML. What I did succeed in writing was a parser for a basic subset of flow-style YAML without unquoted strings and tags, so basically JSON plus YAML-style hex integers and single and double quoted strings, giving the main advantages of YAML over JSON (including getting rid of the odious surrogate pair escapes!) without the significant indentation. The lpeg/re grammar isn't terribly large (copied from my Moonscript file):

yson_re = re.compile [===[ -- @start-re
  input <- value / not_a_value
  value <- (
      %s*
      ( string
      / number
      / array
      / object
      / 'true'  -> true
      / 'false' -> false
      / 'null'  -> null
      )
      %s*
    / %s* not_a_value
    )
  string <- ( single / double )
  single <- {| "'" ( { [^']+ } / "'" { "'" } )* "'" |} -> concat
  double <- {|
      '"' (
        { [^"\]+ }
      / { '\' ["\/bfnrt0aveN_LP %t] } -> esc
      / ( '\x' { %x^2 } -> hex_char
        / '\u' { %x^4 } -> hex_char
        / '\U' { %x^8 } -> hex_char
        )
      / bad_esc
      )* '"'
    |} -> concat
  number <- {
      '-'?
      ( '0x' %x+
      / ( '0' / [1-9] %d* )
        ( '.' %d+ )?
        ( [eE] [-+]? %d+ )?
      )
    } -> tonumber
  object <- {| '{' %s* '}' / '{' kv ( ',' kv )* '}' |} -> object
  kv <- {|
      ( %s* !string not_a_key )?
      %s* {:k: string :} %s* ( !':' bad )? ':'
      %s* {:v: value :} %s* ( ![,}] bad )?
      ( &',' !( ',' %s* string ) bad )?
    |}
  array <- {|
      '[' %s* ']'
    / '[' {| value |} ( ![],] bad )?
      ( ',' {| value |} ( ![],] bad )? )* ']'
    |} -> array
  bad_esc <- {|
      {:pos: {} :}
      {:msg: { '\' . } -> 'Unknown or invalid escape "%1"' :}
    |} => fail
  not_a_key <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected key (string) near "%1"' :}
    |} => fail
  not_a_value <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected value near "%1"' :}
    |} => fail
  bad <- {|
      %s* {:pos: {} :}
      {:msg: { %S } -> 'Unexpected "%1"' :}
    |} => fail
-- @stop-re ]===],

bpj avatar Aug 01 '22 09:08 bpj

reStructuredText does something interesting here. They re-use definition list syntax; when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Nice thing is that we already have a nice readable syntax for that.

jgm avatar Aug 01 '22 17:08 jgm

@jgm wrote:

reStructuredText does something interesting here. They re-use definition list syntax;

Nice!

when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Can't say I like that, since it should be legal to place a definition list as the first thing in the document, so some delimiter (three or more of some punctuation character!) would seem in order. Are ~~~ or +++ taken?

Nice thing is that we already have a nice readable syntax for that.

Yes. Some questions:

-   Would multiple "definitions" become a list?

-   and a nested "definition list" a nested mapping?

-   Would values be verbatim or be parsed as markup? If the latter there should IMO

1.  be a way to mark a value as a raw string, maybe

: raw

`This is a simple raw string`{=}

: more raw

```=
This is a multi-line
raw string
```
  1. Be a way to mark a nested definition list as an actual definition list in the value, maybe by giving it an attribute block, which may contain just a comment.
  • Would/could a bibliography (cf. #32) be included in metadata? I think it should also use definition list syntax but have its own block delimiter (maybe @@@ if @ marks a reference as such).

  • Might it be possible to store values which look like numbers as numbers? In Lua terms val = tonumber(val) or val.

  • Might it be possible to have metadata contain booleans, and if so how would they be represented?

bpj avatar Aug 02 '22 17:08 bpj

Are ~~~ or +++ taken?

~~~ currently works as a delimiter for code blocks.

uvtc avatar Aug 02 '22 19:08 uvtc

Are ~~~ or +++ taken?

I think the +++ would work well for metadata. It's a good punctuation character to use for a fence. It's not terribly pretty, but that's ok since metadata blocks are not terribly common, and should probably draw attention when they are present. And the + sign makes me think of something that's being added (here, the metadata).

uvtc avatar Aug 05 '22 05:08 uvtc

my current (Makefile‐based) workflow involves cat-ing a number of YAML files onto the front of a Markdown document prior to it being read in by Pandoc. i’m not too attached to YAML as a format, but it would be nice to support append‐only solutions for providing metadata (i.e., ones which don’t require any processing of the file itself). this means:

  • the metadata can be included directly at the beginning or ending of the file (at least one of the two; ideally both)

  • a file which already has metadata can have more metadata appended, with conflicting terms resolved somehow

nested metadata is useful in my experience for namespacing, although

foo:
  bar: etaoin
  baz: shrdlu

can usually be represented as

foo-bar: etaoin
foo-baz: shrdlu

supporting lists/arrays is more important, as they are more difficult to represent through alternate means

marrus-sh avatar Aug 14 '22 20:08 marrus-sh

In case there is still doubt about the topic, I am highly in favor of document metadata being within the document.

Is there some reason that the comment character could not be co-opted to serve as a docstring for metadata? Comment block at the start of the document can contain whatever syntax is chosen to define key:values. In Rust, a // is a standard comment, but a /// notes a docstring, giving a cheap way to detect it. Then again, I believe many sins have been committed by utilizing comment blocks for data.

Anyway, big fan of the project, and I am waiting on the sidelines for the eventual release.

dbready avatar Nov 03 '22 03:11 dbready

Another option -- we already have syntax to associate arbitrary metadata with elements: attribute {.foo #bar baz="quux"} syntax. We just don't have a nice way to attach that to the document as a whole, but I think we can do something like "if the doc starts with attributes and they are followed by a blank line, the attributes belong to the document's node":

{
  author="matklad"
  date="2022-11-03"
}

# Consider using Djot for your next presentation

matklad avatar Nov 03 '22 11:11 matklad

{
  author="matklad"
  date="2022-11-03"
}

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table. Not that that matters. But I suggested a metadata format like this on markdown-discuss 15 years ago.

However, I think it's important to consider what types of data will go into the metadata fields. Our attributes are just strings. But string content isn't adequate for metadata. E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

jgm avatar Nov 03 '22 15:11 jgm

E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

My gut response here would be to leave these kinds of metadata to the processors. Eg,

# Title With _Inlines_

::: abstract

some table or what not

::: 

and let the specific rendered to interpret abstract as metadata, and pull title there as well.

matklad avatar Nov 03 '22 16:11 matklad

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table

I love having a way to serialize data without a new bespoke syntax. One nice thing about Markdown documents that embed YAML/TOML in the preface is that I can easily read/export that format without a new parser. Lua tables (with nil) feels great.

dbready avatar Nov 03 '22 17:11 dbready

I like the idea about using attribute syntax a lot, but less so the idea that it be a Lua table. Would that mean that Lua escapes are legal in the string? I assume \<punct> escapes are already legal in attributes, while Lua only supports \" \' \\, and what about \n and the like? In fact Lua table syntax isn't all that portable: you do need e.g. a JSON library to exchange data with other languages.

bpj avatar Nov 03 '22 17:11 bpj

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]

jgm avatar Nov 03 '22 18:11 jgm

Heya, just my two-cent 😅

I think it might be helpful to compare a representative "in the wild" Markdown front-matter.

I feel YAML is certainly the most "readable", but this obviously comes with the unfortunate over-complexities for parsing. Perhaps a subset of YAML would be nice, removing some of the more problematic features, as in https://hitchdev.com/strictyaml/features-removed/ 🤔

YAML

version: 1
title: My Document
author:
- name: Author One
  affiliation: University of Somewhere
- name: Author Two
  affiliation: University of Nowhere
abstract: |
	This is my very,
    very, very, long abstract...
toc: true
format: 
  html: 
    # some comment ...
    code-fold: true
    html-math-method: katex
  pdf: 
    geometry: 
    - top=30mm
    - left=20mm

TOML

version = 1
title = "My Document"
abstract = """This is my very,
very, very, long abstract...
"""
toc = true

[[author]]
name = "Author One"
affiliation = "University of Somewhere"

[[author]]
name = "Author Two"
affiliation = "University of Nowhere"

[format.html]
# some comment ...
code-fold = true
html-math-method = "katex"

[format.pdf]
geometry = [ "top=30mm", "left=20mm" ]

Lua Table

{
  version = 1,
  title = "My Document",
  author = {
    {
      name = "Author One",
      affiliation = "University of Somewhere"
    },
    {
      name = "Author Two",
      affiliation = "University of Nowhere"
    }
  },
  abstract = [[
This is my very,
very, very, long abstract...
]] ,
  toc = true,
  format = {
    html = {
      -- some comment...
      ["code-fold"] = true,
      ["html-math-method"] = "katex"
    },
    pdf = {
      geometry = { "top=30mm", "left=20mm" }
    }
  }
}

JSON

(no comments allowed)

{
  "version": 1,
  "title": "My Document",
  "abstract": "This is my very,\nvery, very, long abstract...\n",
  "toc": true,
  "author": [
    {
      "name": "Author One",
      "affiliation": "University of Somewhere"
    },
    {
      "name": "Author Two",
      "affiliation": "University of Nowhere"
    }
  ],
  "format": {
    "html": {
      "code-fold": true,
      "html-math-method": "katex"
    },
    "pdf": {
      "geometry": [
        "top=30mm",
        "left=20mm"
      ]
    }
  }
}

chrisjsewell avatar Nov 10 '22 16:11 chrisjsewell

If leaning on an existing format, the chief benefit is being able to read/write document metadata without a bespoke parser. Is StrictYAML codified where this would be an option in other languages? Similar problem for JSON – I think supporting comments should be a goal, but most JSON parsers do not support a comment syntax. Perhaps JSON5 is standardized enough to be considered?

Then again, djot is an entirely new format which already requires a custom parser, but it would be nice to get the metadata formatting for free.

dbready avatar Nov 12 '22 21:11 dbready

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]

If the metadata is a lua table, would the parser be able to evaluate functions within it? If so, this might be a great feature for things like datetime or time-based UUIDs. I use markdown + YAML a lot for zettelkasten notes and academic writing (with pandoc); a functional metadata can really extend a textfile's usage cases.

Also, I just stumbled on this project a few days ago and love its potential and vision! Keep up the awesome work!

mcookly avatar Dec 09 '22 21:12 mcookly

  1. I do not like the idea of executable code in the document. Use cases of that nature seem more appropriate to an extension mechanism. If someone wants to embed a block of code in the front-matter and evaluate it, that should be possible, but not the default.
  2. While Lua is the current implementation and being discussed as a serialization format, I do not expect Lua semantics to carry through. That is, would a Python/Javascript/Rust djot parser have to embed Lua so as to properly render a document?

dbready avatar Dec 09 '22 23:12 dbready

I do not like the idea of executable code in the document.

Me neither, at least not by default. It might be somewhat less scary if executed in a custom environment insulated from the file system, but that might be severely limiting when you cannot load modules. An alternative might be a custom variable interpolation or even template system with limited capabilities. I have written such a processor for MoonScript/Lua but it uses Lpeg/re and as such is not appropriate for djot. Before Pandoc included lpeg/re in its Lua API I had written a parser in pure MoonScript/Lua but it was a lot of code: 700+ lines, a whole parser implementation of its own. With lpeg/re I'm down to about 300 lines not counting what is done by the lpeg/re modules, which still is at the upper bound for what I'm comfortable with inlining into a Pandoc filter. That includes a mechanism for pluggable functions and some default functions, which make up around a third of the code. I usually add around 20-60 lines of extra functions and variable data, and that's a MoonScript class, so I'm back at some 700 lines of Lua code, plus dependency on lpeg/re.

bpj avatar Dec 10 '22 10:12 bpj

Leaving executable code as an extension makes sense. And if djot's parsers are moving away from lua as @dbready mentioned, embedding lua just to read metadata seems extraneous. I don't think any of the other common metadata formats allow for code execution natively, and they probably prevent this for good reason.

If metadata code execution is left to the program, then you can just pass in code through the program's custom metadata field, like pandoc's header-includes. And if djot will be adding its own native serialization format, I assume it could allow passing in code blocks / inline code through the metadata. Either way, code is not directly executed when rendering the document.

mcookly avatar Dec 10 '22 13:12 mcookly

There's also Hjson which looks like this:

version: 1
title: My Document
abstract:
  '''
  This is my very,
  very, very, long abstract...

  '''
toc: true
author: [
  {
    name: Author One
    affiliation: University of Somewhere
  },
  {
    name: Author Two
    affiliation: University of Nowhere
  }
],
format: {
  # some comment ...
  html: {
    code-fold: true,
    html-math-method: katex
  },
  pdf: {
    geometry: [
      "top=30mm", "left=20mm"
    ]
  }
}

It's basically json, but it doesn't require quoting keys and it has comments and nice multi-line strings.

tmke8 avatar Jan 07 '23 14:01 tmke8

Another potential choice is NestedText. It's designed to be simple to parse yet still humanly readable (based on YAML). Here's an example:

version: 1
title: My Document
abstract:
  > This is my very,
  > very, very, long abstract...
toc: true
author:
  -
    name: Author One
    affiliation: University of Somewhere
  -
    name: Author Two
    affiliation: University of Nowhere
format:
  # Some comment ...
  html:
    code-fold: true
    html-math-method: katex
  pdf:
    geometry: [ "top=30mm", "left=20mm" ]

It only has three types: dictionaries, lists, and strings. There's even a more simplified version.

mcookly avatar Jan 09 '23 04:01 mcookly

Trying to think more holistically, an eventual goal of this markup is that non-programmers could adopt it in various places: blogs, academic papers, forums, etc. In which case, using an existing JSON/YAML/TOML format is a disadvantage: for a layman, it becomes a bespoke “header metadata” format different from the rest of the djot markup.

From the angle of minimizing language size, I am in favor of matklad’s suggestion to use the existing djot attribute syntax. Less for a user to learn and easier to implement a parser.

dbready avatar Jan 09 '23 07:01 dbready

If existing djot syntax is to be used, which I think is a good idea, it is best to use definition/(un)ordered list syntax so that hierarchical structures are possible, for example multiple authors as a bullet list and the name/affiliation/email of each as a definition list.

bpj avatar Jan 09 '23 09:01 bpj

I'm very much in favour of metadata in djot documents. In pandoc I use title, author, date, and lang nearly everywhere. Often I add references local to one document (visited web pages).

My two cents (and sort of mentioned elsewhere): I suspect native definition lists will do, possibly wrapped inside a meta (or perhaps even djot?) div:

::: meta
title
:  Title of document
author
:  Author A
:  Author B
:::

When using a designated div type (like meta above) it will be possible to not only add a metadata block at the top of the document but also add meta data in later parts of the documents (perhaps, again, the citation information of a visited web site).

ffel avatar Jan 10 '23 20:01 ffel

This probably doesn't affect what you want to say, but that isn't djot definition list syntax!

jgm avatar Jan 10 '23 22:01 jgm

Yes that is one of my biggest issues with Pandoc. I like the idea of templates, but from a non programmer's perspective I never got into templates, so including metadata directly into documents is much appreciated here.

tbdalgaard avatar Jan 11 '23 20:01 tbdalgaard

@tbdalgaard templates in Pandoc have to do with metadata only in as much as you can access metadata values from templates, but you can notably also access metadata from filters and use that to either insert metadata into documents or to configure filters. The original way to define metadata in Pandoc was through YAML blocks in the document body. Later we got the --metadata-file=YAMLFILE option and later still the metadata: section in defaults files. I'm not sure how getting metadata into Pandoc with metadata files/default files works with the new djot.js/JSON workflow. Hopefully it works. @jgm?

bpj avatar Jan 11 '23 21:01 bpj