sphinx-immaterial Incorporation of TensorStore C++ autosummary extension

Incorporation of TensorStore C++ autosummary extension

Open jbms opened this issue 2 years ago • 39 comments

I have a TensorStore C++ autosummary also mostly implemented, although not yet released in the TensorStore repository. I would like to potentially incorporate it into this theme, but wanted to find out if that might be useful to anyone else. (If not, I'll just maintain it separately within TensorStore.)

It works similarly to the TensorStore Python autosummary extension, in that every entity is documented on a separate page.

At the top level, entities are organized into groups (like doxygen groups) and e.g. a .. cpp-apigen:: group directive is used to insert the summary of the members of a group.

May 03 '22 04:05 jbms

in that every entity is documented on a separate page.

I have been browsing the tensorstore docs, and I found it confusing when I click on a cross-reference and it loads a separate page with 1 memeber on it.

I'm very interested in this C++ autosummary idea because I haven't found a good alternative to breathe that cuts Doxygen out of the workflow.

May 03 '22 05:05 2bndy5

I agree that for members with just a short description, a separate page doesn't really make sense. I figured that over time the documentation would be expanded and almost everything would have a longer description and examples and such, and therefore justify a separate page.

For example, the documentation for this member is quite long and would not work well without a separate page: https://google.github.io/tensorstore/python/api/tensorstore.TensorStore.write.html

Note: The numpy docs put every function on a separate page.

But I was also thinking about adding a rule where if the documentation is short enough, not to generate a separate page.

In principle there could also be an option to do something other than separate pages for each entity --- would just have to figure out what that "something else" would be.

With the C++ autosummary, there is a first pass that uses libclang to generate a JSON description of the API (this could happen either separately from the documentation build, or as the first step of it), and then the actual autosummary extension reads this json file.

May 03 '22 05:05 jbms

Your example does make sense for using a separate page. To be clear, I was confused at the structure of the TOC. It looks like each TOC entry is specific to a class or module? I didn't expect that clicking a item in the secondary TOC would lead me to a subpage. That approach kind of exemplifies #58

May 03 '22 05:05 2bndy5

I've been looking into the libclang python wrapper. There is a types-clang stub lib for it also. I think the stubs are maintained by the LLVM team, but not all objects are typed properly (namely the File object). We may also have to provide conf.py options that expose CLI args to libclang (like -p for a compilation database).

The Cursor.raw_comment doesn't involve parsing, so, I'm wondering what syntax is expected & if the syntax could be configurable. Naturally, we'd have to strip the raw_comment of the actual C++ comment syntax (eg. ///, /** .. */, //!, /*! .. */, //<) and split it into lines. I think it would be easiest to expect RST syntax by default, but it would be extremely handy to support markdown and/or a doxygen flavor (which I think is a combination of markdown and javadoc syntax - still researching this bit).

Jun 23 '22 01:06 2bndy5

The current version that I have implemented only allows /// commands and expects rST syntax, but additionally translates some doxygen commands like \param and \returns. Supporting other syntax would be possible, though.

Jun 23 '22 06:06 jbms

I have written a test script that strips out all comment like syntax:

def strip_comment(cmt: str = None) -> Optional[str]:
    """
    :param cmt: the `Cursor.raw_comment` property.
    :returns: A list of the lines without the surrounding C++ comment syntax.
    """
    if cmt is None:
        return None
    # the first line is never indented (in clang-parsed form)
    # so dedent all subsequent lines only
    first_line_end = cmt.find("\n")
    cmt = cmt[:first_line_end] + dedent(cmt[first_line_end:])
    # split into a list of lines & account for CRLF and LF line endings
    body = [line.rstrip("\r") for line in cmt.splitlines()]

    # strip all the comment syntax out
    if body[0].startswith("//"):
        body = [line.lstrip("/").lstrip("!").lstrip("<").strip() for line in body]
    elif body[0].startswith("/*"):
        body[0] = body[0].lstrip("/").lstrip("*").lstrip("!")
        multi_lined_asterisk = True  # should be ok for single-line comments
        if len(body) > 1:
            line_has_asterisk = [line.startswith("*") for line in body[1:]]
            multi_lined_asterisk = line_has_asterisk.count(True) == len(body) - 1
        body = [
            (line.lstrip("*").lstrip("<").strip() if multi_lined_asterisk else line.strip())
            for line in body
        ]
        body[-1] = body[-1].rstrip("*/").rstrip()
    return body

Still testing it out though...

Jun 23 '22 06:06 2bndy5

I found a similar implementation for autodoc-ing C code using libclang... It monkey patches the libclang python binding to expose methods that allow using clang to parse the comments and return them as xml (see monkey patch src here).

So, for a function's docstring that looks like


/*!
 *  \brief A member function.
 *  \param c a character.
 *  \param n an integer.
 *  \exception std::out_of_range parameter is out of range.
 *  \return a character pointer.
 */
const char* Test6::member(char c, int n) const

the patched clang binding's parsed comment can be interpreted with lxml, which results in

<Function isInstanceMethod="1" file="path\\to\\tests\\doxygen\\func.hpp" line="23" column="20">
  <Name>member</Name>
  <USR>c:@S@Test6@F@member#C#I#1</USR>
  <Declaration>const char *Test6::member(char c, int n) const</Declaration>
  <Abstract>
    <Para> A member function.  </Para>
  </Abstract>
  <Parameters>
    <Parameter>
      <Name>c</Name>
      <Index>0</Index>
      <Direction isExplicit="0">in</Direction>
      <Discussion>
        <Para> a character.  </Para>
      </Discussion>
    </Parameter>
    <Parameter>
      <Name>n</Name>
      <Index>1</Index>
      <Direction isExplicit="0">in</Direction>
      <Discussion>
        <Para> an integer.  </Para>
      </Discussion>
    </Parameter>
  </Parameters>
  <Exceptions>
    <Para> std::out_of_range parameter is out of range.  </Para>
  </Exceptions>
  <ResultDiscussion>
    <Para> a character pointer.</Para>
  </ResultDiscussion>
</Function>

This will provide a better way to parse Doxygen style comments rather than parsing the raw text manually. 👍🏼

Jul 07 '22 07:07 2bndy5

That's interesting, I didn't know about the xml comment output from clang.

I've been working on putting together an initial/draft version of the cpp apigen based on what I implemented for tensorstore. At this point it is almost ready.

Jul 07 '22 08:07 jbms

I'm working in a separate repo (local only - nothing's been pushed). I'm trying to write a replacement for breathe, but there are things that doxygen does and clang doesn't; namely

parsing markdown files/text - maybe I can use the MyST parser API for this
automatic cross referencing (without any explicit syntax) - this will require a comprehensive search and replace according to parsed members (from translated units/modules). I'm not sure if this is a rabbit hole I want to explore just yet.

I eagerly await your implementation.

Jul 07 '22 08:07 2bndy5

I started exploring the XML approach because the doxygen commands can have nested paragraphs (using @par and @parblock ... @endparblock) within a command that is designed to only accept a single paragraph. Furthermore, Doxygen paragraphs end on either a blank line or a line that starts with new command that triggers a new paragraph/list/etc. It is my hope that the clang comment parser can alleviate this complexity.

Jul 07 '22 23:07 2bndy5

Doxygen's grouping is probably one of the worst implemented features it has. I quick tested this doctring

/**
 *  @ingroup group2
 *  @brief class C3 in group 2
 */
class C3 {};

and got the following parsed xml

<Class file="path\to\tests\doxygen\group.cpp" line="35" column="7">
  <Name>C3</Name>
  <USR>c:@S@C3</USR>
  <Declaration>class C3 {}</Declaration>
  <Abstract>
    <Para> class C3 in group 2</Para>
  </Abstract>
  <Discussion>
    <Verbatim xml:space="preserve" kind="verbatim"> group2</Verbatim>
  </Discussion>
</Class>

So, we'd still have to parse Verbatim nodes to extend/fix Doxygen's grouping commands. Notice that libclang does not include a cursor just for estranged comment blocks like

/**
 *  @defgroup group2 The Second Group
 *  This is the second group
 */

It also may not consolidate consecutive comment blocks like Doxygen does.

Jul 07 '22 23:07 2bndy5

I don't intend to support @page (and the like) commands because that is specific to Doxygen's ill-conceived static site generator (🤮).

Jul 07 '22 23:07 2bndy5

To be clear --- is that XML output from clang or from doxygen?

One of the nice things about the way I'm currently parsing doc comments is that line number information is preserved, so that any Sphinx warnings and errors are shown with a location in the original source file.

In general, I think it may be sufficient to just support the most common doxygen commands, e.g. so that most doc comments don't have to be changed.

Jul 07 '22 23:07 jbms

What is it that you don't like about doxygen's grouping? The grouping feature I've implemented for the C++ and Python apigen extensions is certainly inspired by doxygen, though there are some differences.

Jul 08 '22 00:07 jbms

To be clear --- is that XML output from clang or from doxygen?

I'm not running Doxygen at all. Its all clang output (using the monkey patch I referenced) then parsed with lxml.etree.fromstring(). The XML output you see is from print(lxml.tostring(<root _Element>, pretty_print=True).decode())

What is it that you don't like about doxygen's grouping?

It divides up the summary into "modules" (not to be confused with compilation units). Anything within groups or subgroups won't show in its original namespace's document summary. This makes navigating Doxygen-generated docs that extensively use the grouping (for C++ libs) very confusing.

Jul 08 '22 00:07 2bndy5

I see you expose filtering options in the config like dis/allow_symbols/namespaces/macros. Could this be simplified by exposing these as directive options (like autodoc's :members:)? The regex approach seems geared toward developer's ease instead of user's ease.

The ignore_template_parameters and hide_initializers seems like they would be more specific to certain object types (like classes/functions/attributes).

Jul 08 '22 05:07 2bndy5

The way it currently works is that in the builder-inited event, which happens before sphinx actually reads any of the rst documents (or even determines the list of rst files) we parse the code, extract the API description as json, and generate the separate pages for documenting each entity. Then the directives that are parsed when sphinx processes the rst files just insert group summaries.

By the time the rst directives are parsed, it is too late to generate more files, at least with the current sphinx build process.

The built-in autosummary extension does something hacky to work around this limitation --- it first does its own "parsing" of all the source files to locate any autosummary directives, but this isn't a real parsing of the rst, just some regular expressions. That means it doesn't work for another directive to generate an autosummary directive.

In any case for tensorstore the current configuration method seemed to work pretty well, in order to exclude internal/private apis, but basically include everything else.

Jul 08 '22 06:07 jbms

I didn't see any undoc'd members or access specifier filtering. I'd like to add that as well, but sometimes the Cursor.access_specifier.name rightfully shows as "INVALID" (for decls directly in global/namespace scope) or "NONE" (for certain definitions)...

In research, I decided to keep a list of parsed USR and avoid serializing entities (to JSON) with duplicate USR ids (this sped up the serialization a bit). Clang seems to copy the first found docstring for decls or defs, which made having duplicate USRs redundant to me.

Jul 08 '22 06:07 2bndy5

Currently there is no access specifier filtering, but entities without a doc comment are skipped.

There can be more than one declaration but only one definition within a single translation unit. The information from all occurrences is merged (doc comment, default template arguments).

Jul 08 '22 06:07 jbms

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

Jul 08 '22 06:07 jbms

I often see undoc'd member's names as if they're self-explanatory (and could be useful if cross-refing an undoc'd member).

Jul 08 '22 06:07 2bndy5

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

Some APIs have more than 1 entrypoint though. I agree with this conception, sadly it isn't a perfect world.

Jul 08 '22 06:07 2bndy5

I wish clang would expose a way to filter AST by relativity to indexed unit.

Jul 08 '22 06:07 2bndy5

The idea is that you arrange to include all the headers you want to document in a single translation unit (i.e. unity build), that way every header is parsed only once.

Some APIs have more than 1 entrypoint though. I agree with this conception, sadly it isn't a perfect world.

What do you mean by entrypoint? As long as they can be used together, you can just create a new header that #includes all of the headers you need.

Jul 08 '22 06:07 jbms

I wish clang would expose a way to filter AST by relativity to indexed unit.

What do you have in mind as far as relativity?

Jul 08 '22 06:07 jbms

If it isn't directly in the translation unit's Cursor.location.file, then ignore it. That would make our lives a bit simpler, and I imagine users could better understand what needs to be specified as source files (instead tracing their own src's includes).

Jul 08 '22 06:07 2bndy5

That could be done with the allow_paths option but is basically the opposite of what I intended as far as the unity build approach, unless your library defines everything in a single header.

Jul 08 '22 06:07 jbms

The issue with parsing every public header of a library separately is that there is likely to be huge overlap in the transitive dependencies of those headers, which means a lot of headers will have to be parsed multiple times, making it much slower.

Jul 08 '22 06:07 jbms

What do you mean by entrypoint?

Some projects have a lib and an executable as separate building entrypoints.

As long as they can be used together, you can just create a new header that #includes all of the headers you need.

This sounds like a good idea for passing in the unsaved_files arg to clang.cindex.Index.parse(). Maybe we could ask the user for all lib sources and assemble a header file like this on the fly. Thus, the need for only 1 translation unit is ensured.

Jul 08 '22 06:07 2bndy5

Um, how do you want to collab on this? I ask because I know you like to force push after amending git history. Should I PR any proposals into the cpp-apigen branch?

I'm going to patch in support for all doc comment prefixes and a corresponding function to strip the comment syntax (resulting in just the comments' text as List[str]).

Jul 08 '22 13:07 2bndy5

sphinx-immaterial sphinx-immaterial copied to clipboard

Incorporation of TensorStore C++ autosummary extension

sphinx-immaterial
sphinx-immaterial copied to clipboard