odoc icon indicating copy to clipboard operation
odoc copied to clipboard

[WIP] Extract code blocks by ID

Open rizo opened this issue 6 years ago • 41 comments

This is a work in progress attempt to implement testable code examples for odoc. Or to be more precise, the extraction of annotated code blocks from mli and mld files. See https://github.com/ocaml/odoc/issues/130 for a proposed design and discussion. I might work on tools complementary to odoc to facilitate the execution of the examples and the integration with dune.

rizo avatar Feb 12 '19 18:02 rizo

There's some facility like this that got built for the Real World OCaml book, isn't there?

pmetzger avatar Feb 12 '19 22:02 pmetzger

So is the idea that odoc would extract the code samples and then some other mdx-like tool would call this feature of odoc as part of its processing? That could work, although mostly odoc works on .cmt files whilst I would expect that tool to work directly on mli/mld files. Also not all comments have an associated id, so its not clear how you'd want to deal with those.

Maybe you could outline how you see this working to give a clearer picture of how this part will be used?

lpw25 avatar Feb 13 '19 07:02 lpw25

That could work, although mostly odoc works on .cmt files whilst I would expect that tool to work directly on mli/mld files.

.mli files are subject to pre-processing. That's the reason why we have cmti files no ?

Maybe you could outline how you see this working to give a clearer picture of how this part will be used?

I don't know what @rizo had in mind but following the link he provided leads to this

dbuenzli avatar Feb 13 '19 08:02 dbuenzli

.mli files are subject to pre-processing. That's the reason why we have cmti files no ?

Sure, and I would expect that a tool to run the examples in them would need to perform such pre-processing. Certainly similar tools like toplevel_expect_test do. Since we already need to know the various compilation options in order to run the examples that does not seem unreasonable.

Also I would expect such a tool to work like other "expect test" style tools, so to produce an updated version of the ml file, so that we can use dune's promote workflow -- that clearly requires taking the mli file as input.

I don't know what @rizo had in mind but following the link he provided leads to this

There are a variety of suggestions on the thread @rizo linked so I wasn't sure which suggestion was being referred to. It does look like he has the comment you point to in mind. That comment though does not constitute a full proposal. It is not clear to me how you would use those commands to build a useful workflow.

lpw25 avatar Feb 13 '19 09:02 lpw25

I should say though, that I think any proposal is going to require the syntax to allow code blocks to be labelled in some way -- so that part seems useful regardless of how we build the surrounding tooling.

lpw25 avatar Feb 13 '19 09:02 lpw25

That comment though does not constitute a full proposal. It is not clear to me how you would use those commands to build a useful workflow.

I'm not sure what you are refering to. Basically with the proposal you can see .cmti or .mld file as a tar archive with text files that can be extracted. The rest is yours.

To give an example say you have this example.mld file:


{0 Examples}

The hello world program goes like this:

{hello.ml[
let () = print_endline "Hello"
]}

You may not need this data:

{data.json[
  [1; 2]
]}

Then odoc extract --list-id example.mld outputs:

hello.ml
data.json

Your build system can then pickup these file names to invoke:

odoc extract --id hello.ml -o hello.ml example.mld
odoc extract --id data.json -o data.json example.mld

At which point you may instruct your build system to do something with the generated hello.ml file.

dbuenzli avatar Feb 13 '19 09:02 dbuenzli

Sure, but how do I map the results of using hello.ml back to the example.mld file afterwards? We lose the mapping between the comments and the code so you can't use this to implement an "expect test" style workflow -- which is I think the most desirable workflow for this kind of thing.

lpw25 avatar Feb 13 '19 11:02 lpw25

Sure, but how do I map the results of using hello.ml back to the example.mld file afterwards?

One can add an option to extract to emit linenum directives in the output.

We lose the mapping between the comments and the code so you can't use this to implement an "expect test" style workflow -- which is I think the most desirable workflow for this kind of thing.

I don't know exactly what this is but wouldn't the above suffice ?

dbuenzli avatar Feb 13 '19 11:02 dbuenzli

Perhaps I should clarify what I mean by an "expect test" workflow. I think that one of the most useful aspects of being able to run examples from documentation is to be able to write things like:

{[
    # let foo x y = y;;
    val foo : 'a -> 'b -> 'b = <fun>
    # foo 1 3;;
    val - : int = 3
]}

and have a tool which checks that the results of running that code in the toplevel match what I've written there. In an "expect test" style workflow this is achdieved by having the tool print a new version of the .mli/.mld file with the actual generated output. Dune has special support for such tools -- it will show the diff between the actual and expected output as an error and you can run dune promote to copy the actual output over the original file.

lpw25 avatar Feb 13 '19 11:02 lpw25

As examples this is how the mdx and toplevel_expect_tests tools work.

lpw25 avatar Feb 13 '19 11:02 lpw25

Well I don't expect odoc to make the whole promotion business. Aren't #linenum directives surrounding each extracted snippet sufficient for the promotion tool to do its business ?

dbuenzli avatar Feb 13 '19 11:02 dbuenzli

Well I don't expect odoc to make the whole promotion business.

Me neither. I don't think odoc needs to provide support for any of this. As long as its comment parser is available as a library all of these things can easily be implemented as separate tools -- including the feature being suggested here.

lpw25 avatar Feb 13 '19 11:02 lpw25

As long as its comment parser is available as a library all of these things can easily be implemented as separate tools -- including the feature being suggested here.

Maybe but OTOH odoc is the tool that processes .mld files, I'm not sure there's any gain to introduce yet another tool, especially if the scheme mentioned above is reasonably usable for most workflows.

dbuenzli avatar Feb 13 '19 11:02 dbuenzli

I can see how this proposal would be better than what I'm suggesting for literate programming, is that what you have in mind? In which case, would only having the support for doing it for .mld files be sufficient? That version makes more sense to me.

My other main concern is the semantics of the "id"s. Most systems that have quoting of code fragments like this (markdown, org-mode, etc.) have support for specifying the language of the code fragment along with other data about how the code should be extracted. Specifying the language seems particularly useful for odoc to allow us to correctly do code highlighting in the resulting documentation. I'm worried that we won't easily be able to do that if we merge this patch as is because we'll already have specified a different semantics for the text that appears between { and [.

lpw25 avatar Feb 13 '19 12:02 lpw25

I can see how this proposal would be better than what I'm suggesting for literate programming, is that what you have in mind?

That's one of the worfklow along with .mld based tutorials. I'm also interested in extracting my code samples from .cmti files and make sure they compile though. If I take these examples I'm not really interested in topexpecting them. Just extracting them with linenum directives and make sure they actually compile in my dev builds (and potientially install them without linenum directives as sample code).

My other main concern is the semantics of the "id"s. Most systems that have quoting of code fragments like this (markdown, org-mode, etc.) have support for specifying the language of the code fragment along with other data about how the code should be extracted.

Agreed I wouldn't be against more structure here. FWIW in CommonMark, the id is an "info string". and is it only suggested that the first word should specify the language. See here.

dbuenzli avatar Feb 13 '19 12:02 dbuenzli

If I take these examples I'm not really interested in topexpecting them

The expect test style approach wouldn't give you anything extra, but it would give you what you want in these cases. So if we assume that such support is on the way then I think you could support the extract command only for .mld files and you would have all the things you are after.

lpw25 avatar Feb 13 '19 13:02 lpw25

Not sure I understand the deep reason for objecting doing this on cmti files aswell :-) Presumably it's the same code path.

dbuenzli avatar Feb 13 '19 14:02 dbuenzli

I just want to discourage people from trying to use it to implement the execution of code in comments in mli files since I don't think it is the best way to do it. I'd prefer it if odoc only had commands for things that have a clear use case and where the command is the best approach for addressing that use case.

Whereas the mld version has a clear use case in literate programming and, as you said, odoc is the main tool that understands mld files so it is the logical place to put the feature.

lpw25 avatar Feb 13 '19 15:02 lpw25

I have pushed an early attempt to document this feature (see draft version here). It mostly focuses on user experience and integration with dune. I understand this is might be significantly out of odoc's scope but, as I mentioned previously, it is not my intention to exclusively focus on low-level odoc-specific interfaces. It is not clear (to me) how all the discussed details (extraction, execution, line numbers, test promotion, etc) connect together into something users can use. And that is my focus.

Let me know if it is preferred to move this discussion to dune.

CC @rgrinberg: you might have an opinion about this.


So is the idea that odoc would extract the code samples and then some other mdx-like tool would call this feature of odoc as part of its processing? That could work, although mostly odoc works on .cmt files whilst I would expect that tool to work directly on mli/mld files.

I think both options could be supported. odoc's parser library is already exposed and it can be used to do the extraction.

Also not all comments have an associated id, so its not clear how you'd want to deal with those.

I covered this in the proposal. Essentially all code blocks without an associated id can be extracted as a common "anonymous" group.

Maybe you could outline how you see this working to give a clearer picture of how this part will be used?

I do not at this stage have a final plan for the fully integrated solution. One of the reasons I submitted this PR as WIP was to start this discussion. And, as you mentioned, annotating code blocks is needed in any case.

I started experimenting with mdx and it seems like it could be used to build a tool (I'm calling it odoc-mdx) that would use odoc's parser as a library to extract the code blocks from mli/mld files, execute them and generate .corrected files. I'm hoping dune can be extended to natively support this workflow.


My other main concern is the semantics of the "id"s. Most systems that have quoting of code fragments like this (markdown, org-mode, etc.) have support for specifying the language (...)

This is something I mention in my proposal. Essentially using filenames gives us both: the ability to group code blocks and the language information. It does require odoc to interpret code block annotations as file names, but I think it is a reasonable limitation.

I'm worried that we won't easily be able to do that if we merge this patch as is because we'll already have specified a different semantics for the text that appears between { and [.

By that do you mean "other data about how the code should be extracted"? For OCaml, at least the execution options could be passed as # directives (similar to how toplevel_expect_test works).

Agreed I wouldn't be against more structure here. FWIW in CommonMark, the id is an "info string". and is it only suggested that the first word should specify the language.

Mdx actually relies on this. I think we could adopt this model. How would multiple code blocks be grouped for extraction into the same file though?

Maybe something like the following could work? The extraction command would match on the file value.

{ocaml file=hello.ml[ ... ]}

Not sure I understand the deep reason for objecting doing this on cmti files aswell :-)

I just want to discourage people from trying to use it to implement the execution of code in comments in mli files

I'm not sure I understand what this means. Does this only apply to the CLI? Wouldn't the extraction from cmti (or mli) files be required to implement the expect-based workflow?

rizo avatar Feb 13 '19 19:02 rizo

Thanks @rizo for putting together this design. It jibes with what I've been thinking about, though I think there is one key difference.

Should odoc require code block annotations to be filenames with extension?

I don't think so. For an mli file, the workflow I imagined was:

  1. In a doc comment that includes a code example you'd like to execute, you add a small annotation, either {ocaml| or {x| (for "execute" -- we can assume the block is OCaml; perhaps you could have another token to specify the language).

  2. In your dune file in the mli's directory, you specify that you want to execute marked code blocks.

  3. If your code blocks contain toplevel syntax, like

# 1 + 1
- : int = 2

then you go through the diff-and-promote workflow. (That is, if the output is incorrect, you generate a .corrected file and allow the user to easily diff against and accept this correction.) There shouldn't be any need to add line number directives -- the tool should figure out where in the mli file this code lives, and should be able to patch in the correction appropriately. (Toplevel_expect_test and mdx both do this.)

Otherwise, if you're not dealing with toplevel-style code, you simply throw a compiler error if the code block doesn't compile.


Note that this workflow is almost exactly the same as processing a markdown file with mdx, except that instead of executing the code in triple-backtick blocks, you're executing the code in {ocaml| or {x| blocks.

For mld files, the flow would be basically identical to mdx, except that instead of the base markup language being Markdown, it would be the markup language used by odoc.

By doing things this way, without filenames, you do give up the ability to write code in one block that references input from files extracted from other code blocks -- but I don't think that's nearly as important as being able to simply run small self-contained examples that use the library you're documenting.

As the user of the tool, I don't want to have to think about the code in my doc comments being extracted out into a separate file where it's then executed; I just want the code executed. If the tool needs to extract my code to separate files under the hood, that's fine, but I don't think that detail should be exposed to the user. The user (at least this user) just wants to say "execute this block" on some blocks.

jsomers avatar Feb 13 '19 21:02 jsomers

@jsomers I don't think what is described here prevents from implementing a system that does exactly what you want. What is described here is rather the plumbing and a more general mecanism that allows you to implement the system you would like as a special case.

There shouldn't be any need to add line number directives -- the tool should figure out where in the mli file this code lives, and should be able to patch in the correction appropriately.

Extracting with line directives is precisely what will allow the underlying tool to figure out where in the mli/mld file the code you are processing lived and, for example if you compile the extraction, to actually report any compilation error in the original file and location rather than in the extracted file.

dbuenzli avatar Feb 13 '19 22:02 dbuenzli

@dbuenzli

There shouldn't be any need to add line number directives -- the tool should figure out where in the mli file this code lives, and should be able to patch in the correction appropriately.

Extracting with line directives is precisely what will allow the underlying tool to figure out where in the mli/mld file the code you are processing lived and, for example if you compile the extraction, to actually report any compilation error in the original file and location rather than in the extracted file.

Ah, I see -- my worry is that we would require the end user to add line-number directives by adding a special notation. But it sounds like you're saying we wouldn't.

jsomers avatar Feb 13 '19 22:02 jsomers

Ah, I see -- my worry is that we would require the end user to add line-number directives by adding a special notation. But it sounds like you're saying we wouldn't.

No, that would be absolutely terrible :-) Basically the lineno stuff would work as as follows. Suppose you have your file.mld:

Bla bla
{ocamlx[
let f x = x
]}
Ho Ho
{ocamlx[ 
let () = print_endliiine "bla.ml"
]}

Invoking odoc extract-code --with-line-numbers --anonymous=ocamlx -o ocamlx.ml file.mld would produce the ocamlx.ml file:

#3 "file.mld" 
let f x = x
#7 "file.mld"
let () = print_endliiine "bla.ml"

Now you can try for yourself to compile that with ocamlc that's what gets reported:

> ocamlc ocamlx.ml
File "file.mld", line 7, characters 9-24:
Error: Unbound value print_endliiine
Hint: Did you mean print_endline?

dbuenzli avatar Feb 13 '19 22:02 dbuenzli

Thanks for the design doc @rizo that makes your plan much clearer. I think the aim is a good one but in general I think that the proposed extract-code command is not the right approach to implement what you are aiming for. From your comments I think you already know what the disadvantages and alternatives are, but I'll make them explicit here anyway:

The need to talk about external files is an unnecessary burden for users. In the common case they simply want to execute all the comments in a single environment. This behaviour should be the default. If they wish to keep some comments separate there should be an optional argument like the env argument in mdx to specify named environments with no requirement that they be unique filenames, something like:

{ocaml env=foo[
...
]}

If they wish not to run a particular comment there should be an optional argument for that to:

{ocaml executed=false[
...
]}

The dune support should simply be something like (execute-examples) in the documentation stanza. Having to manage the extraction of multiple files is an unnecessary burden on the user.

The extract-code command is not sufficient to implement the "promote" aspect robustly. It provides no mechanism to pass arguments about how the code should be run to the actually testing tool -- for example mdx's non-deterministic and version options on code snippets would be useful. Line directives are not a robust means of communicating when blocks start and end -- if I put a line directive in my comment fragments the tool will simply break. To do this kind of thing robustly you would need some quoting mechanism and to use a format other than just an .ml file.

The extract-code command is also not necessary to implement the "promote" aspect. The comment parser is already available and the tool can simply use that on the parsetree of the mli file directly.

I started experimenting with mdx and it seems like it could be used to build a tool (I'm calling it odoc-mdx) that would use odoc's parser as a library to extract the code blocks from mli/mld files, execute them and generate .corrected files. I'm hoping dune can be extended to natively support this workflow.

This is I think a better approach than extract-code and I think it would be better to focus on this.

Maybe something like the following could work? The extraction command would match on the file value.

{ocaml file=hello.ml[ ... ]}

I think that would be better for implementing extract-code. As I said above, extract-code on mld files is as useful command for literate programming and doing it this way will prevent it from getting in the way of the mdx-based alternative.

lpw25 avatar Feb 14 '19 08:02 lpw25

I think that would be better for implementing extract-code. As I said above, extract-code on mld files is as useful command for literate programming and doing it this way will prevent it from getting in the way of the mdx-based alternative.

I'm not exactly sure how something generic like this gets in the way of mdx-based alternatives. Let's seperate concerns here:

  1. Basically the {[]} has to be extended in a useful way. So let's agree on the syntax. What's your proposal ? Simply a list of uninterpreted tokens ? I'm fine with this aswell though we should be careful that such a token allows to express a file path.

  2. Regarding extract-code if you think it's not useful for you and you can do better using an external tool then by all means go build that tool. Personally I have quite a few uses for it and I would like to see it implemented in some way as proposed here.

dbuenzli avatar Feb 14 '19 10:02 dbuenzli

I'm not exactly sure how something generic like this gets in the way of mdx-based alternatives

Something like that seems fine. I was commenting on the proposal as written in the design doc.

  1. Basically the {[]} has to be extended in a useful way. So let's agree on the syntax. What's your proposal ? Simply a list of uninterpreted tokens ? I'm fine with this aswell though we should be careful that such a token allows to express a file path.

I am suggesting an uninterpreted list of tokens, but I'm open to other suggestions. Anything that gives enough flexibility to put fairly arbitrary flags and options is fine.

2. Regarding extract-code if you think it's not useful for you and you can do better using an external tool then by all means go build that tool. Personally I have quite a few uses for it and I would like to see it implemented in some way as proposed here.

My comments are in regard to the design document @rizo wrote. I think that extract-doc is the wrong way to implement what that proposal attempts. As I've said, I think extract-doc on mld files makes sense for literate programming, but I've not heard any use cases for extract-doc on cmti files. If you have a use case or motivation for it then I would like to know what it is. Without a specific use case in mind how are we supposed to judge whether it is the right feature or not?

lpw25 avatar Feb 14 '19 10:02 lpw25

As I've said, I think extract-doc on mld files makes sense for literate programming, but I've not heard any use cases for extract-doc on cmti files.

These samples are full executables. They are cut and pasted from the mli to here. I want to extract them once with lineno directives in my dev builds to make sure they compile, and once without lineno directives to install them in the doc directory of the package as sample code.

dbuenzli avatar Feb 14 '19 11:02 dbuenzli

As the user of the tool, I don't want to have to think about the code in my doc comments being extracted out into a separate file where it's then executed; I just want the code executed.

@jsomers Users should not think about any kind of extraction, unless they really need this feature (see the examples @dbuenzli referenced). I will clarify this in the proposal. I agree overall that the default behavior should be execute all ocaml code blocks without any extra configuration.

rizo avatar Feb 14 '19 11:02 rizo

These samples are full executables.

Thanks that makes things clearer. So you have some literate programming within the mli files of a library. That's a reasonable use case. My preference would be to have extract-code take the .mli file directly as input, since this should really be an operation on source files rather than installed files. It shouldn't be much harder to implement than the cmti version -- note that dune already takes care of preprocessing the mli file to a (marshalled) .mli file so you don't need to deal with -pp -- but if that is not the case then the .cmti file based version is a reasonable alternative.

lpw25 avatar Feb 14 '19 11:02 lpw25

I agree overall that the default behavior should be execute all ocaml code blocks without any extra configuration.

I don't think this concerns this actual proposal but I would rather not do this and rely on a special token for that (e.g. ocamlx).

You still need a bit of ordering care when you start executing code blocks. When I don't want to care about this I prefer not having to care... by which I mean that having to specify an exec=false would mean I would have to care. For example I want to be able to simply add an ocaml code block anywhere in the document without having to think about it or have to define all its free variables so that it can get executed.

For all it's awesomeness and shininess I still think that executing code blocks in API docs remains quite a limited tool as soon as you escape the realm of documenting basic data structures.

dbuenzli avatar Feb 14 '19 11:02 dbuenzli