datafusion
datafusion copied to clipboard
EPIC: Improve Documentation, Tutorials, and Examples
Is your feature request related to a problem or challenge? Please describe what you are trying to do. DataFusion is too difficult to learn for new users. See https://towardsdev.com/writing-a-data-pipeline-in-rust-with-datafusion-25b5e45410ca for one users experience, which is summarized here:
- The API documentations are pretty bad, the less frequently used function does not provide any document or incomplete document and lacks examples that how to use them. So I had to guess and try out many things to use some of the function such as to_timestamp date_part when(...).otherwise()
- Data Reading example also lacks example, there is only 2–3 example mentioned in the API doc. But you might need another way of reading the data. For example, first define the schema and then use the schema to read the data without inferring the scheme by the framework. But this in not in the doc
- There is no tutorial like example in the doc, this also true for Rust API for Polars DataFrame
- From user point of view, I think documentation is the weakest part of the Framework, on top of that rust is not that easy itself.
Describe the solution you'd like TBD. This issue is an EPIC to track tasks to improve the situation.
User Guide
- [x] https://github.com/apache/arrow-datafusion/issues/3065
- [x] https://github.com/apache/arrow-datafusion/issues/3066
- [x] https://github.com/apache/arrow-datafusion/issues/3091
- [ ] https://github.com/apache/arrow-datafusion/issues/3092
- [ ] https://github.com/apache/arrow-datafusion/issues/3399
- [ ] Add a tutorial section?
Rust Docs (docs.rs)
- [ ] Update SQL functions rustdocs with example usage
Examples
- [ ] Review current examples and see how they can be improved
- [ ] Include examples reading from object stores (S3, Azure, GCS)
Developer/Contributor Guide
- [ ] TBD
Python Docs
- [ ] https://github.com/apache/arrow-datafusion-python/issues/33
Older Issues To Be Reviewed
- https://github.com/apache/arrow-datafusion/issues/1814
- https://github.com/apache/arrow-datafusion/issues/1578
- https://github.com/apache/arrow-datafusion/issues/1487
- https://github.com/apache/arrow-datafusion/issues/1352
- https://github.com/apache/arrow-datafusion/issues/825
Describe alternatives you've considered None
Additional context None
As a relatively new user of DataFusion, I agree the docs on the site are pretty bad, but not so bad that I wasn't able to understand it without being overwhelmed :) However, there's plenty of room for improvement and I've been doing a few things related to this:
- reviewing all the documentation-tagged issues in this project
- reviewing other similar project's documentation to see how it's structured, what's helpful to me as a newbie to their project, what's confusing about their documentation
- trying (locally) different structure for documentation based on the above
I've come up with a few changes that I think will help:
- realize that the DataFusion library itself is the product here, not DataFusion-CLI. DataFusion-CLI is a tool for demonstrating the power of the library quickly and easily for potential users (among other uses, but to me this seems to be the primary use case -- if it's not, it's worth clarifying what the intended use of DataFusion-CLI is).
- as such, DataFusion-CLI should have a supporting role in the documentation
- as such, there should be more examples how potential new users can use DataFusion-CLI to quickly demonstrate for themselves how DataFusion library can help them
- therefore there should be multiple examples how to run DataFusion-CLI, how to register a variety of data into DataFusion context, and the power of the SQL -- examples should cover loading data from local, from object store like S3, partitioned and not, different formats, etc. Queries should be run. Explain plans and
explain analyzeshould be shown. - probably CLI itself will need some changes to make it easier to use it against S3 or Azure Blob Storage. It should be able to parse the location given in a
create external tablecommand to automatically register an object store using sensible default authentication methods. for example, if I already have my environment configured to use the AWS CLI, then I should be able to startup DataFusion-CLI and runcreate external table test stored as parquet location 's3::/my-bucket/content'and it should "just work".
- there should be a separate User Guide and a Developer Guide (or maybe call it a "Contributor Guide"?)
- yes, in the User Guide, I think the functions we have in DataFusion should be documented. for new prospective users browsing docs, it's important to see the wide variety of useful functions that exist.
- related to this, some of the documentation makes it appear that DataFusion is less capable than it actually is -- in particular, the bit that describes the very basic SQL syntax that works (implying that more complex SQL won't work)
- ETL is called out as particular use case for DataFusion, but none of the examples demonstrate ETL pipelines. In my opinion, definitely need some examples or even just write-ups about how DataFusion can work as an ETL tool.
It's still very much a WIP, but I have a structure that I think mostly makes sense in this branch: https://github.com/kmitchener/arrow-datafusion/tree/doc-improvements (related to PR #3005 )
The above incorporates some of the thoughts from #1821 and #1814 as well.
Thanks @kmitchener that is great feedback
I had a conversation with @MrPowers today which inspired me to try and organize ideas to improve the datafusion documentation today.
I moved all the unfinished tickets into a new epic https://github.com/apache/arrow-datafusion/issues/7013 and am going to close this one so the current state of things is clearer. Let's continue the conversation there