odoc
odoc copied to clipboard
Add a marshalled output for index generation
This PR adds a new marshalled output for the generation of index files.
This is desirable for several reasons:
- It allows to combine several pre-built indexes, making the index generation more incremental
- It can be read by other OCaml tools using odoc as library, for instance when building a search engine index (cc @EmileTrotignon). #1022 is also relevant.
The first commit fixes a bug in the computation of the ID of the page containing a standalone comment. The second commit is the actual change.
There are some design to discuss:
- This PR adds a
--marshall
flag to thecompile-index
: by default the output is json, but it is marshalled with this flag. - The files containing marshelled indexes have to be prefixed by
index-
and have to have the.odoc
extension. - Currently, the format is a hashtable. I think it would be better to have a better datastructure, similar to the one to contain occurrences.
- The complexity when merging indexes many times is not ideal. For instance the following commands would be in O(n^2)
$ odoc compile-index --marshall -o index-all1.odoc a1.odocl
$ odoc compile-index --marshall -o index-all2.odoc a2.odocl index-a1.odoc
$ odoc compile-index --marshall -o index-all3.odoc a3.odocl index-a2.odoc
$ odoc compile-index --marshall -o index-all4.odoc a4.odocl index-a3.odoc
$ odoc compile-index --marshall -o index-all5.odoc a5.odocl index-a4.odoc
...
$ odoc compile-index --marshall -o index-all<n>.odoc a<n>.odocl index-a<n-1>.odoc
$
But I think that is ok.
I do not like the index-<name>.odoc
syntax. It is unusual to express file type with a prefix, and it might break scripts/makefiles that use *.odoc
. .odoci
or .odoc_index
would be better.
I must admit I'm quite confused at the difference between *.odoc and *.odocl (are there others) ?
@Drup
.odoc
represent the data of a single page. .odocl
also, but the page has been linked with others so that it is possible to resolve references.
In term of command line, you need a single module (cmti ?) to create a .odoc
, but you need every .odoc to create a .odocl
.
There are no others except the one in this proposal.
I think the idea is that .odoc
files are the result of the compile
phase (the odoc compile
command) while .odocl
files are the result of the link
phase (odoc link
). Those phases and the dependencies inside them is not completely defined in odoc, but we are working on this.
In addition to the compile
command (which create files corresponding to modules and standalone pages), several other commands generate .odoc
files:
-
odoc compile-src
produces a.odoc
file (containing the required implementation info to render source code) -
odoc count-occurrences
produces a.odoc
file (containing a marshalled table mapping ids to number of occurrences) -
odoc source-tree
produces a.odoc
file (containing a source tree used to generate html navigation of directories when rendering source code)
Implementation and source trees are produced during the compile phase, and are meant to be linked. Contrary to this, occurrences and indexes are generated after the link phase, so I definitely agree that their extension should not be odoc
nor odocl
.
About the index-<name>.odoc
syntax: it is not that unusal, at least in odoc:
- pages must be named
page-<name>.odoc
- implementations must be named
src-<name>.odoc
- occurrence tables must be named
occurrences-<name>.odoc
- source trees must be named
srctree-<name>.odoc
So, I think we have to options:
- Define a single extension for "post-link" files such as occurrences and indexes, and use prefix to distinguish them. For instance
occurrences-<name>.odoca
,index-<name>.odoca
andtoc-<name>.odoca
(table of contents that will be introduced in the future might need a post-link file) - Define one extension per "post-link" files:
<name>.odoc-occurrences
,<name>.odoc-index
,<name>.odoc-toc
.
I would prefer the first option.
About the index-
.odoc syntax: it is not that unusal, at least in odoc:
I know it is quite common in odoc, but at least for files names I really do not like it because it is a weird convention that fails to communicate its intent, so this requires knowledge of odoc to understand. It might also restrict names available for users, in a way that will be unexpected : no one will try to name a page .index
(or at least they will understand that it might have meaning), but index-
might happen.
I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca
have completely different (binary) content and they cannot be used interchangeably at all.
Regarding implementation and sources, it seems to me that they are used in the same context (linking) and as such can share an extension. Their content could even be of the same ocaml type if we wanted to (maybe they are ?).
I deeply favor (2), even though I understand that it requires changes out of the scope of this PR.
The only benefits for (1) I found in your message is the fact it is done like that in other places in odoc, but as it seems insufficient with regard to the drawbacks expressed above. Maybe there are other benefits I did not consider ?
The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page
I'm not decided on which is better, but let me still answer to some of you comments
this requires knowledge of odoc
Using odoc
requires knowledge of odoc
. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!
no one will try to name a page .index (or at least they will understand that it might have meaning), but index- might happen
This is not a problem at all. I don't think there can be any issue:
- a compilation unit named
<name>
cannot have-
in it - Other odoc produces artifact are prefixed: for instance, the compilation of
index-bli.mld
will generatepage-index-bli.mld
and that would not cause problems
I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca have completely different (binary) content and they cannot be used interchangeably at all.
odoc
files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents.
Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.
So:
-
index-<name>.odoca
,occurrences-<name>.odoca
:- Makes it explicit which phase those files belong to:
.odoca
files belong to the "post-link" phase, which takes.odocl
files as input. - Consistent with what exists in the compile (
.odoc
) and link (.odocl
) phase
- Makes it explicit which phase those files belong to:
-
<name>.odoc-index
,<name>.odoc-occurrences
- @EmileTrotignon could you tell what what are the main benefits, considering what I wrote above?
The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page
(sherlodoc uses entries_of_item
, which is also changed by the bugfix.)
(sherlodoc uses entries_of_item, which is also changed by the bugfix.) I had missed that, the change still looks fine.
Using odoc requires knowledge of odoc. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!
Yes, but it that case, we make up our own convention when a perfectly fine one already exists, so even though it's not unreasonable to ask people to learn things, my impression is that here, we are making it harder when its not necessary.
Other odoc produces artifact are prefixed: for instance, the compilation of index-bli.mld will generate page-index-bli.mld and that would not cause problems
You are right that it will not cause conflicts, but it could be confusing for people who might be trying to understand what is happening looking at _build
, for instance if you also had bli.mld
. Having a bli.odoc-index
would be immediately understood for what it is.
odoc files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents. Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.
Yes, I do not mean that they should be used completely interchangeably, this is not possible if two file have different content, but that there exists a context where they are all accepted. This is the case for json (you can always pass a json file to jq), and I hope it is for .odoc
(a .odoc
is always valid as something to be linked against ? It can always be used to generate and odocl ?).
There is no such context for occurrences and index files, the only thing they have in common is being produced at link time, which is not the usual meaning of a file extension.
Regarding consistency, I think that this convention is more appropriate for src-
and page-
because they have shared uses, which is not the case for index-
and occurrences-
. (This could reformulated into : the meaning .odoc
is not "the output of the compile phase" but "the input of the link phase", and there would be no such meaning for ".odoca")
To me, the main benefit of option (2) is that it makes it clear what you can do with the file, in a way that is understandable with minimal friction. This feel like a way more important information about a file than when it was produced. This is also more important because these files will used by tools external to odoc.
I think I have expressed myself fully, maybe a third person could take the final decision if you still disagree after reading this. .odoca
is already a lot better than just .odoc
.
I have read the diff of the second commit and it looks good to me aside from a very small nitpick.
This has been overhauled significantly, and a re-review would be appropriate.
The extension for index files is now .odoc-index, and you give .odocl files with odoc compile-index --include-rec my_dir
.
The CI is not passing due to the new dependency: https://ocaml.ci.dev/github/ocaml/odoc/commit/f0acee4b79947a0a4ff1643bc675e877045b98a3/variant/debian-12-5.0_opam-2.1
Otherwise, this is good to merge !
I opened an issue on ocaml-ci : https://github.com/ocurrent/ocaml-ci/issues/954
As soon as it is solved, we should reintroduce the driver's dependency on sherlodoc.