arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Open djnavarro opened this issue 3 years ago • 13 comments

This pull request proposes a number of changes to the pkgdown site for the R package:

  • Reduces the content on the README page to the essential points
  • Rewrites the "get started" page to focus on common tasks and novice users
  • Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
  • Moves discussion of Arrow data types and conversions to a new "data types" vignette
  • Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
  • Moves discussion of package naming conventions to a new "package conventions" vignette
  • Moves discussion of read/write capabilities to a new "reading and writing data" vignette
  • Moves discussion of the dplyr back end to a new "data wrangling" vignette
  • Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
  • Minor edits to the "python" vignette to improve readability
  • Minor edits to the "cloud storage" vignette to improve readability
  • Minor edits to the "flight" vignette to improve readability
  • Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page

In addition there are some structural changes:

  • Some vignette filenames have been edited and links updated
  • The pkgdown template now uses bootstrap 5
  • The developer vignettes are now in the main vignettes folder
  • The articles menu organizes the vignettes into meaningful categories
  • The former "project docs" menu has been replaced with a sidebar on the main page

Possible issues as yet unaddressed:

  • I have not yet checked whether the bootstrap 5 template breaks the script inserting the documentation versions switcher
  • Changes to developer vignettes are extremely minimal in comparison to other vignettes. I'm uncertain whether to make further changes there or to defer that to a later PR
  • The "articles" menu currently hides all the developer vignettes under the "more articles" link: they should be made more prominent and easy to find
  • Some topics may not be described as well as we'd like?
  • The proposed edits to the vignettes include a lot more code that is executed at build time. This may not be desirable?

It is still a work-in-progress but feedback is appreciated. To make it easy to preview the proposed changes, a preview copy of the built documentation is posted here: https://djnavarro.net/draft-arrow-docs/

djnavarro avatar Oct 26 '22 04:10 djnavarro

https://issues.apache.org/jira/browse/ARROW-17887

github-actions[bot] avatar Oct 26 '22 05:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 26 '22 05:10 github-actions[bot]

I have not yet checked whether the bootstrap 5 template breaks the script inserting the documentation versions switcher

I'd expect it will (see discussion on https://github.com/apache/arrow/pull/12531). Happy to take a look at that as I did the original version switcher stuff (equally, happy to leave it with you, just let me know).

One thing I've noticed that will have a bigger impact on the version switcher is changing the structure and layout of the vignettes - we'd need to do a load of work to manually map the new filenames to the old ones, as the version switcher looks for the filename in the selected version's website to determine whether to navigate to the older version of that page, or just go to the index page.

All that said though, you could make the argument that whilst navigating between docs for different versions of a function may be desirable, it might be less important for vignettes, and so we could just link to the articles index page in those cases.

thisisnic avatar Oct 26 '22 08:10 thisisnic

The proposed edits to the vignettes include a lot more code that is executed at build time. This may not be desirable?

Could ARROW-17655 help with this?

thisisnic avatar Oct 26 '22 08:10 thisisnic

As this is a large PR (maybe with future ones it might be easier to break into a series of smaller ones to make it easier to review, though fair enough that it's tricky here when you're moving content around as well as updating it), I'm adding checkboxes here so I can work out what I have or have not looked at.

  • [x] Reduces the content on the README page to the essential points
  • [x] Rewrites the "get started" page to focus on common tasks and novice users
  • [x] Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
  • [x] Moves discussion of Arrow data types and conversions to a new "data types" vignette
  • [x] Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
  • [x] Moves discussion of package naming conventions to a new "package conventions" vignette
  • [x] Moves discussion of read/write capabilities to a new "reading and writing data" vignette
  • [x] Moves discussion of the dplyr back end to a new "data wrangling" vignette
  • [x] Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
  • [x] Minor edits to the "python" vignette to improve readability
  • [x] Minor edits to the "cloud storage" vignette to improve readability
  • [x] Minor edits to the "flight" vignette to improve readability
  • [x] Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page

thisisnic avatar Oct 26 '22 08:10 thisisnic

@djnavarro Are all of the "moves discussion of..." items copied verbatim?

thisisnic avatar Oct 26 '22 11:10 thisisnic

@djnavarro Are all of the "moves discussion of..." items copied verbatim?

Unfortunately, no -- they're almost all new vignettes that expand the original content. Genuinely sorry for the huge PR btw -- I wanted to do all this in many small steps but found myself in this situation where everything felt connected to everything else. The reorganisation and rewrites ended up happening in parallel 😬

djnavarro avatar Oct 26 '22 19:10 djnavarro

All that said though, you could make the argument that whilst navigating between docs for different versions of a function may be desirable, it might be less important for vignettes, and so we could just link to the articles index page in those cases.

After thinking about it for a little I'm starting to think this might be the best approach to vignette linking across versions. The structure of the vignettes has changed quite a bit in this rewrite, and there's not really a 1:1 mapping between old and new. I was unhappy about it because I hate breaking backwards compatibility even in the docs, but it's not like we could fix the issue with a simple file rename -- the underlying content is different enough in some cases that there's nothing useful to be gained by pretending that the old "get started" vignette is meaningfully the same document as the new one? So yeah maybe we just map to the "articles" page, since that's guaranteed to be structurally the same thing in every release?

djnavarro avatar Oct 26 '22 21:10 djnavarro

This is great, thank you for taking this on. I'll give the content a close read at some point, but a couple of quick considerations:

  • We should be careful about changing the vignette filenames since they map to URLs and URLs make up an API. For some of the lesser vignettes it's maybe not a big deal, but the number of links to vignette("install") i.e. https://arrow.apache.org/docs/r/articles/install.html I've made out in the internet (and including twice in our own r/configure) make me wary of changing that one in particular.
  • We also should avoid adding .pngs to the R package tarball. We're already at 4.7mb without this change (I only know that because I noticed today when doing the CRAN submission) and the CRAN limit is 5mb. I don't know what the right pkgdown way of handling these extra documents is, but we should do that. We basically are trying to populate the website and don't really care if all of these vignettes ship in the package itself.

nealrichardson avatar Oct 26 '22 22:10 nealrichardson

Thanks Neal!

  • Yeah I think you're right about file renaming. It's not something I'm keen on myself, but found myself being pushed in that direction because of the scale of the rewrite. In some cases the content has changed so much that I'm not sure there's any meaningful continuity to be preserved. That being said, install.Rmd is a really good example of one that we should leave as-is. I think dataset.Rmd (which hasn't changed) is another one like that

  • I wonder if we can convert the vignettes to articles (in the pkgdown sense) so they don't get bundled in the build. It probably means we have to rewrite the cross-linking markdown since vignette("whatever") won't work anymore, but at some point I think we might have to make that change if we want to include images to help make the documentation readable to novice users?

djnavarro avatar Oct 27 '22 00:10 djnavarro

possible solution for the file renaming issue? pkgdown can generate client-side redirects like this: d022810a2b8e104ccf23b36d432bf78342d6256d. not the ideal way to do redirects, but it's an option I suppose

djnavarro avatar Oct 27 '22 05:10 djnavarro

Is it OK to use base pipes (|>) that only works with R4.1 or later in vignettes?

eitsupi avatar Oct 27 '22 11:10 eitsupi

@eitsupi Is it OK to use base pipes (|>) that only works with R4.1 or later in vignettes?

Probably not! I've reverted to magrittr pipes now 😁

djnavarro avatar Oct 27 '22 23:10 djnavarro

Moving my checkboxes down here so I can find them! :joy:

  • [x] Reduces the content on the README page to the essential points
  • [x] Rewrites the "get started" page to focus on common tasks and novice users
  • [x] Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
  • [x] Moves discussion of Arrow data types and conversions to a new "data types" vignette
  • [x] Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
  • [x] Moves discussion of package naming conventions to a new "package conventions" vignette
  • [x] Moves discussion of read/write capabilities to a new "reading and writing data" vignette
  • [x] Moves discussion of the dplyr back end to a new "data wrangling" vignette
  • [x] Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
  • [x] Minor edits to the "python" vignette to improve readability
  • [x] Minor edits to the "cloud storage" vignette to improve readability
  • [x] Minor edits to the "flight" vignette to improve readability
  • [x] Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page

thisisnic avatar Nov 08 '22 12:11 thisisnic

At long last I've moved this out of draft status 😁

djnavarro avatar Nov 22 '22 02:11 djnavarro

okay so we might be ready to merge this? I'm running out of things that I want to try fixing in this iteration of improvements

djnavarro avatar Nov 23 '22 03:11 djnavarro

I've removed the WIP flag from the issue title to run the CI on it.

I'm happy to give the contents a thumbs up and defer any more changes to follow-up tickets.

Before I approve it, I'm going to pull it locally and build it there just to give it a final look over all together.

@nealrichardson - did you want to give this a look over as well?

thisisnic avatar Nov 23 '22 11:11 thisisnic

One thing I noticed, which might be worth a follow up ticket (or maybe I am being too picky) -- in the Articles drop down menu readers can no longer see that there are Developer guides articles. A user needs a second click through More articles... and then scroll a distance. I wonder if the drop down can be tailored a bit to show the categories at first glance?

stephhazlitt avatar Nov 23 '22 19:11 stephhazlitt

I approved this before the CI had run, but there are a few failures relating to linting that need to be sorted before we merge.

thisisnic avatar Nov 23 '22 20:11 thisisnic

@thisisnic I think all the linting issues are now fixed. And yes, I agree this PR is too big to review easily: I'll work harder at splitting into separate PRs in future 🙂

djnavarro avatar Nov 23 '22 21:11 djnavarro

@stephhazlitt Yeah I noticed that issue with the developer vignettes earlier. It would be nice if we could have a menu item that linked to that subsection of the articles page. That way devs would be able to click through with about the same level of ease as they have under the current docs where dev vignettes are tucked into a submenu. I haven't quite worked out how to do that with the bootstrap5 pkgdown templates yet. Tempted to push that to a separate PR 😁

djnavarro avatar Nov 23 '22 21:11 djnavarro

+1 for sure for pushing that into a subsequent ticket/PR @djnavarro. Great, great work on this PR.

stephhazlitt avatar Nov 23 '22 22:11 stephhazlitt

Benchmark runs are scheduled for baseline = 409a95ddc2f1b8aac54612da3658b1f36a734469 and contender = 4afe71030cdd9d3103c7b028082ba63bafdf5d27. 4afe71030cdd9d3103c7b028082ba63bafdf5d27 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.13% :arrow_up:4.06%] test-mac-arm [Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.28% :arrow_up:0.42%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 4afe7103 ec2-t3-xlarge-us-east-2 [Failed] 4afe7103 test-mac-arm [Finished] 4afe7103 ursa-i9-9960x [Finished] 4afe7103 ursa-thinkcentre-m75q [Finished] 409a95dd ec2-t3-xlarge-us-east-2 [Finished] 409a95dd test-mac-arm [Finished] 409a95dd ursa-i9-9960x [Finished] 409a95dd ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Nov 28 '22 21:11 ursabot