[FEATURE]: Pre-process packages available via the DBR without installation
Is there an existing issue for this?
- [X] I have searched the existing issues
Problem statement
The various DBR runtimes include many packages[^1] that are always available and do not need to be installed or declared by notebooks (or jobs): they can simply be used. At present our dependency tracking isn't aware of these.
[^1]: As an example, here is the list of packages for DBR 14.3.
Proposed Solution
The packages included in the various DBR versions should be included in the list of known packages that we maintain.
Additional Context
The published lists for each DBR version are roughly correct; it turns out that the base OS images used also include some packages. I've scanned most of the currently supported DBR versions (9.1, 10.4, 11.3, 12.2, 13.3, 14.1, 14.2, 14.3, 15.1 & 15.2) and produced this list of installed pip packages and the various versions in use across these runtimes.
here are all the packages since DBR 9.x - https://github.com/databrickslabs/sandbox/blob/main/runtime-packages/sample-output.txt
we don't really care about specific versions of those packages. at least for now.
here are all the packages since DBR 9.x - https://github.com/databrickslabs/sandbox/blob/main/runtime-packages/sample-output.txt
Thanks for that, I wasn't aware of that tool and it looks quite useful.
The lists are roughly the same with a few differences here and there. Some notes:
- I didn't enumerate the ML-runtimes.
- The sandbox list is produced via
pkg_resources.working_set, with a few things filtered out. - My version is based on
pip list --format=json. - Both seem to miss a few things; the overlap is about 75%.
I am oké with a "good enough" approach. Great to have full coverage of the pre-installed packages, but good enough to get 80%.
Use make known to update the known.json