cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Define a clear and enforceable standard file layout for cuDF Python

Open vyasr opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe. Aside from the core classes (DataFrame, Series, etc), each of which typically live in eponymous files, the organization of functions in cuDF Python currently leaves quite a bit to be desired. The main offender is the core subpackage, which has a number of functions living in largely arbitrary locations. For instance, we have files algorithms.py and common.py, and it's not clear why pipe was placed in the latter while other functions were placed in the former. We also lack a consistent strategy on when a function gets its own file and when it lives with other functions in a shared module (we are generally OK about one class per file, although even that rule is violated occasionally). This problem also propagates to tests and benchmarks. Tests, in particular, are very disorganized to the point where it's nearly impossible to know which files to look at to determine whether some functionality is tested (cf. #9999). A desire for improved organization has come up in multiple PRs adding developer documentation (see the various PRs contributing to #6481).

Describe the solution you'd like We should come up with a simple, easily enforceable set of rules addressing the following:

  1. What files do classes go into? (I vote one class per file, with the file named for that class)
  2. What files do functions go into? We typically aim to group by functionality, so we likely need to either more subpackages inside core, or (nearly equivalent) we need to decide on a set of named submodules where functions go.
  3. To what extent should (and can) test and benchmark organization mirror the contents of the main package?
  4. How does docs organization fit in? Our documentation is organized to match pandas documentation, which provides a pretty reasonable default organization for us.

The next question is how we enforce these rules. It would be nice if we could find a way to enforce these programatically (ideally with a pre-commit hook). If we use the documentation organization as our baseline, the difficulty here would be that we'd potentially need to do significant rst parsing in order to use this ordering. On the flip side, it would be pretty straightforward to enforce with linting rules.

vyasr avatar Aug 04 '22 21:08 vyasr

The next question is how we enforce these rules. It would be nice if we could find a way to enforce these programatically (ideally with a pre-commit hook).

Do we know of other projects that do something similar?

shwina avatar Aug 04 '22 22:08 shwina

For instance, we have files algorithms.py and common.py, and it's not clear why pipe was placed in the latter while other functions were placed in the former.

Because Pandas :)

>>> pd.core.common.pipe
<function pandas.core.common.pipe(obj, func: 'Callable[..., T] | tuple[Callable[..., T], str]', *args, **kwargs) -> 'T'>

I think it's useful to mimic the way Pandas organizes its public modules. This seems to be the route cuml has taken. Note that because Pandas occasionally makes changes to its public namespaces, we'd have to decide which version of Pandas we want to match (latest?).

That doesn't answer the question about how to organize our internal functions/classes though, and I'd be open to discussions around that.

As for tests/benchmarks, I really think it's useful for them to follow the same layout as the API reference. So tests for DataFrame/Constructor would go in tests/dataframe/constructor.py or tests/dataframe/construction.py.


As for enforcement of any of these rules, I'm worried about over-engineering a solution here. Thus my question above if other projects do similar enforcement for code organization.

shwina avatar Aug 04 '22 22:08 shwina

Hmm so then would we want to move towards matching pandas more closely? For instance, DataFrame is in pd.core.frame.DataFrame, whereas ours is in cudf.core.dataframe.DataFrame. We have an actual Frame class (pandas has NDFrame) so we run into some potential conflicts. How would we want to address that? I think this also circles back to some of the questions that you and @bdice ran into when trying to analyze the public APIs of pandas, namely how we determine what is public and what is not.

If we go with matching pandas for APIs, then for tests/benchmarks I would say we have two options:

  • Follow the same layout as the API reference. That's what we've discussed before.
  • Follow the same layout as the source. Tests go in files that exactly match the source layout.

I have also never seen anything enforcing a file layout and I agree that it might be overengineered unless it's very easy. It would be nice if it were possible, though. @mroeschke suggested this and may have some ideas.

vyasr avatar Aug 05 '22 19:08 vyasr

If we go with matching pandas for APIs, then for tests/benchmarks I would say we have two options:

  • Follow the same layout as the API reference. That's what we've discussed before.
  • Follow the same layout as the source. Tests go in files that exactly match the source layout.

Recognizing that Vyas is on vacation for a few weeks, so it'll be a while before we can pick back up on this discussion, just responding now so I don't forget:

I'm much more in favor of 1. Matching tests with source files 1-1 can lead to unnecessary churn as we frequently move functions/methods across files.

shwina avatar Aug 08 '22 12:08 shwina

Agreed. I think the natural separation of API components is more aligned with the docs than the package structure.

bdice avatar Aug 08 '22 13:08 bdice

+1 as well for aligning the API with the docs as well.

I had thrown out the idea of "enforceable" file layout just noticing in pandas, at least, that having soft conventions in organizing a code base can still make it hard for new and experienced contributors where to place things and causing drift in conventions. Maybe this is a largely unsolved problem in general though. If there's no easily available tool out there today, I definitely agree with the worries about over-engineering a solution.

mroeschke avatar Aug 08 '22 20:08 mroeschke

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 07 '22 21:09 github-actions[bot]