arrow
arrow copied to clipboard
Semantic versioning?!
pyarrow increases major version very frequently. A common practice for libraries is to use semantic versioning, so that:
- increase of major version means breaking changes
- increase of minor version means new functionality (so downgrading a minor version is a breaking change)
- increase of patch version means no breaking change and no new functionality e.g. improvements behind the scene, bug fixes, ...
Since pyarrow is a library used by different libraries and each has different version limitations, using pyarrow is problematic, since it is impossible to tell if the different libraries will work together e.g. one was tested with pyarrow 1.0.0 and one with pyarrow 5.0.0 and pyarrow is now at 8.0.0.
Is it possible to start using semantic versioning for this project and for older versions to specify which include breaking changes?
Thanks.
We technically do use semantic versioning. However, we release several implementations chained together (C++, python, Java, Ruby, R, probably more I'm missing) and it is generally a given that there is a breaking change in at least one of the libraries. Also, libraries like pyarrow have rather extensive sets of functionality, so while the core functionality is stable, there is always a certain amount of experimental API surface.
using pyarrow is problematic, since it is impossible to tell if the different libraries will work together e.g. one was tested with pyarrow 1.0.0 and one with pyarrow 5.0.0 and pyarrow is now at 8.0.0.
Keep in mind that there is a format version in addition to the library version. The Arrow Columnar Format is at version 1.0 (and has been for quite a while). All versions of pyarrow should support format version 1.0. The serialization formats themselves have versions (e.g. parquet, IPC, ...) These versions change rather slowly as well. At the moment all versions of pyarrow should work with all versions of the IPC format.
How important for your use cases is it that the library itself is stable versus the file format? Is your main concern making sure that files produced in one version are usable in another? Or that two apps can send data back and forth? Or are you coming at this as an app developer building on top of pyarrow? In which case your concern might be that someone that installs your app with pyarrow 7 then has your app break when moving to pyarrow 6 or 8?
I do agree there is some friction here. Lots of Python libraries use maximum major version in their dependency lists (e.g. pyarrow<=4
) as a standard practice, and will neglect to update them regularly. They expect to get bug fixes and don't care about new features, but they end up getting neither from us. I recently talked to someone who encountered this with the azureml libraries.
It's possible that pyarrow
is mature enough that there is some significant section of the API we could declare as "stable" (leaving a few pieces as "experimental") and start versioning it independently. It might be an interesting conversation for the ML.
They expect to get bug fixes and don't care about new features
There's a separate issue too which is whether we support bug fixes on older versions.
For both issues (slower python versioning and backporting fixes) the answer will involve more python development work.
I am not use pyarrow directly, but only via libraries.
- I had to write a large compatibility test to make sure they all work together on EMR in order to be able update one (as we run a series of applications on EMR and the 3rd party and internal dependencies are install once on EMR cluster creation).
- Luckily it passed. In my case I'm using pyspark and snowflake, where pyspark has only a minimal version, but is not tested with any newer version and snowflake has an almost exact version i.e. a.b.0<=ver<a.b+1.0 which means it doesn't play nice with any other 3rd party.
Why wouldn't each language-library have its own version? Also, it seems strange that there is at least one breaking change so frequently. If this has to be the case (breaking versions), I would split the project into 2 different libraries (each an articat and namespace of its own) to prevent library nightmare like profobuf 1 vs profobuf 2 which in Java requires shading. However, still, as a library developer myself, I rarely break the API, except for at the beginning when I'm still stabilizing it.
Another solution, is to break up artifacts into smaller pieces, so only small parts that aren't usually used by 3rd party libraries break. This of course, would be a breaking change though, so if you do this, please do it, please use a different artifact and namespace/package name for the change.
Why wouldn't each language-library have its own version?
Partly because the code bases are coupled; PyArrow, the R package, and Ruby package are all based on the C++ library, so we keep them at the same version.
But I think the other reason is the release process is quite rigorous, and it's easiest to verify all packages in one go, rather than have a separate release / verification process for each package.
Finally, PyArrow both contains stable features and some very experimental features that change a lot. I think the major version changes generally represent changes in the experimental features. If something like ARROW-8518 were done to allow us to keep the experimental features in a separate optional module, then I think more stable versioning of the core PyArrow functionality would be easier.
I do agree there is some friction here. Lots of Python libraries use maximum major version in their dependency lists (e.g.
pyarrow<=4
) as a standard practice, and will neglect to update them regularly. They expect to get bug fixes and don't care about new features, but they end up getting neither from us. I recently talked to someone who encountered this with the azureml libraries.
Maybe this is the issue that came to mind ( https://github.com/Azure/azure-sdk-for-python/issues/24644 )?