datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] Automatically generate all function content from code

Open alamb opened this issue 1 year ago • 7 comments

Is your feature request related to a problem or challenge?

Currently, when we add a new function to DataFusion library we have to remember to document that function in the documentation, but currently we have to remember this during code review.

This has a few downsides:

  1. we have forgotten to document some functions
  2. the documentation has drifted over time,
  3. help text for various functions can only be found on the DataFusion website, and not, for example within the function itself.

@Omega359 has created the basic framework for automatically generating content in https://github.com/apache/datafusion/pull/12668

This ticket tracks the work required to port the rest of the documentation to programatic form

Tasks

  • [x] https://github.com/apache/datafusion/issues/12432
  • [ ] https://github.com/apache/datafusion/pull/12743
  • [x] https://github.com/apache/datafusion/pull/12745
  • [ ] Migrate all existing UDF's to the new format (see below)
  • [ ] https://github.com/apache/datafusion/issues/12741
  • [ ] Add support for DESCRIBE xyz; where xyz is a function. Ideally this would support not just the core functions in DF but also any added externally (for example those in https://github.com/datafusion-contrib/datafusion-functions-extra) and verify it works in the CLI
  • [ ] Look into generating the function sections in expressions.md (it's a bit out of date)
  • [ ] Make generate_function_docs.sh error if there is a function without documentation

Aggregate Function Documentation Migration

  • [x] #12827
  • [x] Port / Add Documentation for BitwiseOperation
  • [x] Port / Add Documentation for VarianceSample https://github.com/apache/datafusion/pull/12742
  • [x] Port / Add Documentation for VariancePopulation https://github.com/apache/datafusion/pull/12742

Scalar Function Documentation Migration

  • [ ] https://github.com/apache/datafusion/issues/11172
  • [x] https://github.com/apache/datafusion/pull/12775
  • [x] https://github.com/apache/datafusion/issues/12927
  • [x] #12828
  • [x] #12859
  • [x] #12858
  • [x] #12867
  • [x] #12868
  • [x] #12934
  • [x] #12935

Window Function Documentation Migration

  • [x] Port / Add Documentation for Lead
  • [x] Port / Add Documentation for RowNumber
  • [ ] #12936

alamb avatar Oct 03 '24 15:10 alamb

Update is I have a few follow on PRs

  • [x] https://github.com/apache/datafusion/issues/12741
  • [x] https://github.com/apache/datafusion/issues/12432

Then I figure I will file one or two other tickets to see how easy it is to port other functions

And if all goes well I'll make a ticket storm to port the docs

alamb avatar Oct 03 '24 19:10 alamb

FYI I've finished up the string expression migration locally - just waiting on the updated code from #12742 to land before pushing a PR.

Omega359 avatar Oct 04 '24 21:10 Omega359

FYI I've finished up the string expression migration locally - just waiting on the updated code from #12742 to land before pushing a PR.

Thanks, I'll merge those as soon as they get a review (I can't merge them until they get an approval from another committer)

alamb avatar Oct 04 '24 21:10 alamb

No rush I just wanted to make sure no one was duplicating work needlessly

Omega359 avatar Oct 05 '24 16:10 Omega359

@alamb I think it should be mentioned here that ./dev/update_function_docs.sh should be ran, as it wasn't immediately obvious, and when I did my first migration I took a bit of time to find it.

jonathanc-n avatar Oct 11 '24 04:10 jonathanc-n

@alamb I think it should be mentioned here that ./dev/update_function_docs.sh should be ran, as it wasn't immediately obvious, and when I did my first migration I took a bit of time to find it.

Thank you @jonathanc-n -- I added it to the description of https://github.com/apache/datafusion/issues/12859 and will add it on subsequent tickets

alamb avatar Oct 11 '24 11:10 alamb

Thanks to @jonathanc-n @Omega359 and others we are cranking right along with this. I expect it to be done sometime later this week

alamb avatar Oct 15 '24 15:10 alamb

I think we are pretty close here

After we merge these PRs from @jonathanc-n and myself

  • https://github.com/apache/datafusion/pull/13069
  • https://github.com/apache/datafusion/pull/13047

We then have:

  • [x] https://github.com/apache/datafusion/issues/13021

Then we have just a few more functions to document (lead, lag) and we can turn on a CI check to ensure all new functions are documented

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ ./dev/update_function_docs.sh
/Users/andrewlamb/Software/datafusion
Inserting header
Running CLI and inserting aggregate function docs table
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.34s
     Running `target/debug/print_functions_docs aggregate`
Running prettier
docs/source/user-guide/sql/aggregate_functions_new.md 59ms
'docs/source/user-guide/sql/aggregate_functions_new.md' successfully updated!
Inserting header
Running CLI and inserting scalar function docs table
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/print_functions_docs scalar`
INFO: The following functions do not have documentation:
  - map_values
  - map_keys
  - map
  - map_extract
Running prettier
docs/source/user-guide/sql/scalar_functions_new.md 208ms
'docs/source/user-guide/sql/scalar_functions_new.md' successfully updated!
Inserting header
Running CLI and inserting window function docs table
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/print_functions_docs window`
INFO: The following functions do not have documentation:
  - lag
  - lead
Running prettier
docs/source/user-guide/sql/window_functions_new.md 26ms
'docs/source/user-guide/sql/window_functions_new.md' successfully updated!

alamb avatar Oct 23 '24 11:10 alamb

lead/lag PR @ https://github.com/apache/datafusion/pull/13095

Other functions that still exist on scalar functions page - https://github.com/apache/datafusion/issues/13036

Omega359 avatar Oct 24 '24 20:10 Omega359

Here is a PR to make sure no new functions are added without documentation: https://github.com/apache/datafusion/pull/12938

alamb avatar Oct 24 '24 20:10 alamb

🎉 I think all we need is https://github.com/apache/datafusion/issues/13171 and we can claim this project is complete. Thanks again @Omega359 @buraksenn and @jonathanc-n for pushing it along

alamb avatar Oct 29 '24 18:10 alamb

🎉 epic

alamb avatar Nov 13 '24 20:11 alamb