arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-12526: Pre-generating pyarrow.compute and creating a docstring additions system for pyarrow functions

Open krcrouse opened this issue 2 years ago • 7 comments

This PR addresses both the JIRA issue cited (pre-generate pyarrow.compute) and also a dev thread that suggests creating the ability to add in python docs for functions that inherit from the Arrow C++ would greatly improve the readability for python users.

There are still a few things to work out, such as where in the build process to generate the code and whether a version of the generated code should be checked into version control or not, but @pitrou suggested opening the PR to field comments from developers.

Major points:

  • creates python/docs/additions tree where the reStructrued text docs that include the sections to overwrite. Using raw reSt so that code block examples can be tested using doctest - see the README for more verbose details
  • pyarrow.docutils (or maybe should be _docutils) provides functions to processes python/docs/additions and return a data structure of the components per function.
  • python/scripts/generate_sources.py uses pyarrow.docutils and writes out the code for the compute functions in pyarrow/generated/compute.py. All of the logic from the release-branch pyarrow.compute module that dynamically generated the compute functions has been moved to this script.
  • I didn't check the generated file into the repo because I generally do not include generated files that would be generated by the build process should be in source control, but I realize there are other perspectives on this
  • pyarrow.compute now imports from pyarrow.generated.compute for all of the autogenerated compute bindings. Override and custom functions are still defined here.
  • The old pyarrow._compute_docstrings is gone because its purpose is subsumed in the above.
  • I've updated the tests so that they work with the above changes.

krcrouse avatar May 12 '22 02:05 krcrouse

https://issues.apache.org/jira/browse/ARROW-12526

github-actions[bot] avatar May 12 '22 02:05 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar May 12 '22 02:05 github-actions[bot]

To save you all from having to check out the branch and generate the code, attached is what the python/pyarrow/generated/compute.py file looks like at present (added as a .txt file because github rejects .py) compute.py.txt .

krcrouse avatar May 12 '22 02:05 krcrouse

@krcrouse Just a word to tell you that I hadn't a chance to take a look yet, but it's definitely on my TODO list (or may be switched to someone else's :-)). I hope the wait isn't too demotivating.

pitrou avatar Jul 07 '22 10:07 pitrou

@krcrouse thanks a lot for your work on this!

I didn't yet look in detail, but some general comments / questions on the overall appraoch:

  • I think we will probably want to commit the generated code to the repo (and in any case, doing that for now might also make it easier to review (although you already linked the file above as well)).

  • Can you explain a bit more the need for pyarrow.docutils? (we should also make it a private module, or put in eg python/scripts or so?) As far as I understand, by using the full ability of docutils, we can override any section of the docstring (eg also overriding the Parameters section), instead of only appending content to the docstring? (eg appending the example of notes section). Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content? (I think for all cases you currently have additions in this PR, appending would also be fine?) If appending is sufficient, that might keep the generation code simpler?

  • pyarrow.compute now imports from pyarrow.generated.compute

    Small nit: this doesn't need to be a public module, so maybe something like _compute_generated.py instead?

  • Using raw reSt so that code block examples can be tested using doctest

    We now started to tests docstring examples in general in the meantime (see https://github.com/apache/arrow/pull/13199, https://github.com/apache/arrow/pull/13216, https://github.com/apache/arrow/pull/13325), so we can probably test those docstrings that way as well.

jorisvandenbossche avatar Jul 07 '22 14:07 jorisvandenbossche

@jorisvandenbossche, I'll wait for some more guidance from you all and just respond to your comments inline for the moment.

  • I think we will probably want to commit the generated code to the repo (and in any case, doing that for now might also make it easier to review (although you already linked the file above as well)).

That makes sense and probably helps in the overall structure.

  • Can you explain a bit more the need for pyarrow.docutils?

docutils is used to process the reStructured text appendices so that they can be merged with the autogenerated docs. In this model, I propose using reSt for the function documentation additions because it's established, it's testable, and contributors will at least understand what it is even if they're not proficient in it.

If your question is more about "the need for it as a required module" - If we include the generated files then the docutils requirement only needs to be a build dependency in order to generate the actual code files. It does not need to be included as a requirement for the actual package for end users.

(we should also make it a private module, or put in eg python/scripts or so?)

Agreed - it should be a private module.

As far as I understand, by using the full ability of docutils, we can override any section of the docstring ... Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content?

I think we would want both options. Since the default documentation is pulling from the C++ library code, I think you could browse the current generated documentation and see sections that are not useful and could be entirely overwritten. I also think the hybrid approach of creating default documentation with options to append and/or overwrite is best because it will pull in changes to the core C++ function interface automatically while preserving the manually provided improved pythonic documentation.

Take, for example, the parameter definitions of replace_with_mask - in the context of the still not terribly verbose description, the parameter types are incorrect and the explanation of each parameter is useless ('Argument to compute function."). I don't have the code in front of me at the moment, but having implemented this function in pyarrow 7.0, there are several oddities in how these parameters are handled that should be in the param descriptions and not merely as an appended paragraph at the bottom.

Small nit: this doesn't need to be a public module, so maybe something like _compute_generated.py instead?

  • Using raw reSt so that code block examples can be tested using doctest

Agreed.

krcrouse avatar Jul 11 '22 02:07 krcrouse

Hi, @jorisvandenbossche and @pitrou,

I've pushed new updates to this branch based on the comments, including the fully generated source files in pyarrow/_compute_generated.py

I've resolved updates to the tests and included the new compute functions that have been added since the original creation of the branch.

In line with the movement towards docutils, the following will test all of the examples that get pulled into the pyarrow.compute function documentation: python -m doctest -v docs/additions/compute/*

krcrouse avatar Jul 24 '22 16:07 krcrouse

Hi @jorisvandenbossche and @pitrou,

I made a number of additions so that the autogenerated _compute.py code conformed with flake8, since it doesn't distinguish between hand written and autogenerated code. I also updated the generated function signatures to be compliant with Python 3.7.

Please let me know how you would like to move forward on this / if you do.

If I'm reading the workflow output correctly, I think the only issue right now is that archery is failing due to the lack of licenses in the ReSTructured text files, and I am not sure how that can be added as I don't know of any standard for adding licenses to ReST (output from workflow failure below)

INFO:archery:Running Docker linter
apache-rat license violation: python/docs/additions/compute/all.rst
apache-rat license violation: python/docs/additions/compute/any.rst
apache-rat license violation: python/docs/additions/compute/count.rst
apache-rat license violation: python/docs/additions/compute/count_distinct.rst
apache-rat license violation: python/docs/additions/compute/filter.rst
apache-rat license violation: python/docs/additions/compute/index.rst
apache-rat license violation: python/docs/additions/compute/indices_nonzero.rst
apache-rat license violation: python/docs/additions/compute/mode.rst
apache-rat license violation: python/docs/additions/test_example.rst

krcrouse avatar Aug 26 '22 12:08 krcrouse

@krcrouse thanks for the updates! And sorry for the slow follow-up on your responses.

As far as I understand, by using the full ability of docutils, we can override any section of the docstring ... Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content?

I think we would want both options. Since the default documentation is pulling from the C++ library code, I think you could browse the current generated documentation and see sections that are not useful and could be entirely overwritten. I also think the hybrid approach of creating default documentation with options to append and/or overwrite is best because it will pull in changes to the core C++ function interface automatically while preserving the manually provided improved pythonic documentation.

(yes, to be clear my question about the need for docutils was about this (do we think appending is sufficient, or do we want the more fine grained replacement), and not to question to use of restructuredtext)

I am personally still a bit hesitant to go the full docutils way here. I certainly see the value of the flexibility it provides, but it also does introduce quite some additional code that needs to be maintained for this. And for now, all the doc additions are pure "append" ones, which could be implemented with much less code (although I know it's the goal to expand the set of doc additions, of course).

Take, for example, the parameter definitions of replace_with_mask - in the context of the still not terribly verbose description, the parameter types are incorrect and the explanation of each parameter is useless ('Argument to compute function.")

Yes, fully agreed that's basically useless. Those are dummy auto-generated on the python side, and the more general solution might be to actually start including argument descriptions in the C++ docs, so we can pull that as well into the python docstrings. There is a TODO about this (cc @pitrou):

https://github.com/apache/arrow/blob/b832853ba62171d5fe5077681083fc6ea49bfd44/cpp/src/arrow/compute/function.h#L132-L135

Although of course, if only Python would make use of those extra descriptions, we could maybe as well keep them on the python side ..

(to be clear, I am not saying that we certainly don't want the more advanced docutils-based approach, but some input from others would be welcome)

jorisvandenbossche avatar Aug 26 '22 13:08 jorisvandenbossche

Would we need a check in CI that ensures the generated compute file is up-to-date? (I am not directly sure how we do that in other places with generated files)

jorisvandenbossche avatar Aug 26 '22 13:08 jorisvandenbossche

Would we need a check in CI that ensures the generated compute file is up-to-date? (I am not directly sure how we do that in other places with generated files)

Yes. One straightforward (though arguably not terribly elegant) way to check this would be to have the CI task run python scripts/generate_sources.py with a flag to redirect output and ensure that the output was exactly the same as the python/pyarrow/_compute_generated.py file.

krcrouse avatar Aug 30 '22 23:08 krcrouse

@jorisvandenbossche ,

Here's the run down of the most recent push to the branch, which includes updates from upstream as of the end of last week:

Quick Summary Points (many that condense prior one-off comments):

  • Based on your comment, I moved python/pyarrow/_rstutils.py into python/scripts/lib/arrowdoc.py as it is not needed. I don't love a sub-lib of scripts, but it's at least straightforward to then have the generate_sources.py script reference it.
  • As I had replied earlier, there should eventually have a step added to the build or CI process that verifies the committed version of python/pyarrow/_compute_generated.py is the same thing that is output from the generate_sources.py script.
  • The archery lint --rat test is failing because the rst files in python/docs/additions/compute/ don't have license text and I do not know how do that in a sensible way and don't see it in docs or examples.

As for the points you brought up about docutils - I wonder if you are misunderstanding its role and how I use it in the PR. It's probably more straightforward with the reorganization in this round. Now, the docutils modules is only used in the scripts/lib/arrodoc.py file, and there it is only used to parse the reST files in python/docs/additions/compute/ into the document tree that is then merged/replaced/appended to the docs pulled from the C++ libraries. I did add it to the requirements-build and requirements-wheel-build files for the PR, but neither of those are really accurate. It's required solely to run the script, and so I felt that it would need to go into some sort of "requirements," but if you're not regenerating the compute functions for python, it's not required for anything.

krcrouse avatar Sep 04 '22 15:09 krcrouse

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

amol- avatar Mar 30 '23 17:03 amol-

@krcrouse sorry for letting this slip my mind. I will try to take a look at the latest version and your last comment shortly.

jorisvandenbossche avatar Mar 31 '23 07:03 jorisvandenbossche

@krcrouse sorry for letting this slip my mind. I will try to take a look at the latest version and your last comment shortly.

@jorisvandenbossche, sure let me know if this is something that you would like to pursue. I haven't updated it in quite a while since there hasn't been much interest but I'd be willing to do so if there's a desire to incorporate it.

krcrouse avatar Apr 04 '23 16:04 krcrouse