pandas2
pandas2 copied to clipboard
Iterative changes instead of all of them in 2.0 version?
(Moved from https://github.com/wesm/pandas2-design/issues/1)
Disclaimer: I'm not involved in pandas development so my opinion here is not very informed. Sorry about that. :-/
According to my (little) experience in software development, huge refactors and changes in a library are often more difficult to drive forward and expose a larger bug surface than small, incremental changes. Moreover, releasing new versions with incremental changes helps users to test new functionality/changes and allocate enough time to identify and report new bugs.
Could changes proposed in this document be broken into independent "modules" that can be worked on separately and incrementally? For example:
- Removing unused/deprecated functionality.
- Logical/physical storage decoupling.
- Missing data consistency.
- Enhaced string handling.
Even though (2) is probably required to be able to solve (3) and (4), we could get (2) done in a 2.0 release, then (3) in 2.1, (4) in 2.2... etc.
Does that sound reasonable or do you believe it is better to make a big release with all these changes in a row? Are these changes more coupled than I think?
+1. While I am not a pandas developer either, I think what @dukebody is suggesting makes a lot of sense. The only caveat is to foresee which features would be required for some of the dependent modules so that next steps would be easier to implement. I am specially interested in seeing the logical/physical storage decoupling done as early as possible.
The flip side of this is that a large number of incremental API changes can be more painful to adapt to than all at once. I see pandas 2.0 as similar to a Python 3 in that respect. Some changes, such as switching to logical dtypes or distinguishing unicode/bytes, are more painful when done incrementally. Actually, these are quite similar -- both involve fundamental changes to the type system.
So I think we do need to change the type system all at once, but other major features should be saved for followup releases (e.g., external C++ API in #3).
Cleaning up unused/deprecated functionality is a bit of a special case -- pandas is already overdue for this and some of this (e.g., removing Panel
) will make rewriting the internals easier by reducing scope, so if the timing works out it also makes sense to do so at the same time.
I'm all for doing this work as incrementally as possible, especially during the initial project bootstrapping phase. However, several things about this:
- The project has been going for over 8 years on an incremental basis, so we have a pretty large hole of technical debt to dig ourselves out of. Even seemingly "minor" changes, like adding a logical array container (which at this time would just be a box around a NumPy array) and corresponding metadata, would be pretty invasive to the Series / DataFrame internals. If I can make a metaphor: think about this part of the code base like a rubber band ball that's been growing larger and larger over time
- Once we begin to modify the data containers and metadata, we'll be faced with a decision either to gut and remove the BlockManager data structure or maintain it in its present working state, which is not a trivial task at all. To do multiple weeks of "throwaway" development on this seems like not a good use of time since we've been itching to do away with it for a pretty long time (so any time that we sink into it will be discarded eventually).
- As @shoyer said removing functionality serves in part to reduce the surface area of the internal reimplementation, which will mean shipping sooner and less throwaway development
There's obviously a number of "nice to have" features in the design documents. Right now I would prefer to gather as many ideas and hypothetical requirements so that we can come up with a feasible plan to start executing on. At some point we just have to start coding and learning from the process; I'm not confident we can figure everything out in a waterfall-like design plan.
In the classic project management triangle (https://en.wikipedia.org/wiki/Project_management_triangle), in the interest of keeping costs (time + money) low and the schedule as short as possible we need to reduce the scope of what we are doing.
What I don't want to happen is
- Step A (incremental change 1)
- Step B (refactoring to account for change)
- Step C (incremental change 2)
- Step D (refactoring to account for change, more than 25-50% of Step B touched)
- Step E (incremental change 3)
- Step F (refactoring to account for change, more than 25-50% of Step D touched)
At the other end of the spectrum you have
- Step A (all the changes imaginable)
- Step B (refactoring to account for changes)
(as a real example: Perl 6)
The balance comes between the cost of refactoring vs. the cost of the changes -- if most of the refactoring done in a particular step will need to get revisited in subsequent refactoring steps (possibly with user visible API breakage due to unanticipated impacts of subsequent changes) that jumps out to me as a red flag.
Another problem I foresee is interdependencies between work. For example: work cleaning up data representation / missing data and the DataFrame internals will inform the metadata revamp (and possibly result in changes).