trino
trino copied to clipboard
Improve packaging
The current binary packages provided by the Trino suffer from a number of issues.
- RPM packaging is barely used and tested, but creates significant work and has a negative impact on the build time.
- Tarball and docker container are large and contain all plugins, typical usage however only facilitates a small subset of the plugins
- The large artifacts result in longer download and installation times, which also affect deployments.
- Occasionally plugins maybe contain dependencies with reported security issues and therefore Trino might be flagged during security analysis at users (rightly or wrongly so).
- Custom package creation is not really supported or documented.
These issues will get worse in the near future with more connectors being merged and also more native binaries for multiple operating systems and processor architectures being required and included.
This roadmap items collects a number of related work tasks that we want to engage on. Numerous discussions took place prior to filing this issue on slack, in smaller conversations, and at the Trino Contributor Calls and Congregation.
Following are a number of sections that details tasks and ideas. Work on these can be done in parallel.
Pull out RPM ✅
The RPM is rarely used by now and we agree on removal of it from the core trino repo. Since it is build from the tarball however it is possible to pull the rpm packaging aspects out of the core repo into a separate repository that users can use to build an RPM for any Trino version.
The repository https://github.com/simpligility/trino-packages is a first PoC implementation of this approach. The naming is generic since it can also be used for other package creation in separate modules.
Tasks to implement the removal are:
- Update and move trino-packages repo into trinodb org
- Update docs in repo to explain how to create RPM
- Update Trino website to remove RPM download and just link to docs
- Update docs to link to repo and explain that you must build the RPM yourself first before using it
- In the longer run we can maybe even remove the rpm docs from the trino docs completely and just rely on docs in the trino-packages repo
Figure out plugins for different packages ✅
We need to determine what different packages we want to offer and what each package should contain. Following is a first idea
- Minimal
- Only contains what is necessary to start Trino and use it
- So no plugins .. or maybe just memory to allow some initial testing
- Default
- Only contains commonly used plugins
- Definitely remove the following: atop, blackhole, example-http, exchange-filesystem, exchange-hdfs, geospatial, http-event-listener, http-server-event-listener, local-file, ml, mysql-event-listener, openlineage, raptor-legacy, teradata-functions, thrift
- Probably add all lakehouse so hive, hudi, delta-lake, iceberg
- Some others .. but which
Also create a docs page in installation for plugins that talks about adding and removing plugins.
Different tarballs - ✅
Once we figure desired plugins and archive variants, we should adjust the build and publishing process to publish these and update docs as well
We also then need to add docs on how to download and add additional plugins.
Different container images - ✅
Once we figure desired plugins and archive variants, we should adjust the build and publishing process to publish these and update docs as well
We also then need to add docs for:
- How to download and add additional plugins.
- How to add other packages (not package manager since that is removed...)
- How to create a docker container with different base OS from scratch maybe (could be example in packaging repo)
Plugin loading
Over time it might be even better to be able to define a URL or similar pointer to a running system and then load that plugin onto the servers and run it. of course security concerns and other aspects need to be figured out. API for these operations could (or maybe even should) be SQL command similar to the dynamic catalog management features.
cc @nineinchnick
@mosabua,
Just another devex thing we may consider to at least track in an issue, is to build a web app akin to this Spring Boot one where you have a nice interface that hits a service to download a tar file with a custom list of plugins.
We could even make it a rest endpoint on a GitHub action that takes in a list of plugins and trino version and returns the tar file with just those plugins. Then we can just have minimal with in-memory (possibly some local filesystem if we want to replicate the DuckDB experience).
This isn't quite as dynamic as the first one you suggested, but it is a bit more secure and if we make the github app a repository, then a company could fork it into their own build system to stay behind a vpn.
@bitsondatadev that is totally also something that's nice to have once we've done the preparatory work in terms of creating a minimal package and other things as proposed in this ticket. I think it's probably within scope here
We cleaned up a few connectors now
Atop #23550 Local file #23551 Raptor #23588
One thing that came up in @23901 is that some connector and other plugins add global functions to Trino. If we remove such a plugin as part of the packaging clean up .. those functions are no longer available in that packaged version of Trino .. we need to therefore either change implementation of the plugins, ensure they are always there or update docs and explain that they depend on a specific plugin..
Plugins should not expose global functions going forward. That approach is legacy, since trino supports function namespaces.
Can we document using function namespaces in the developer guide?
trino-packages repo now exists and I have access. Will work on PR for RPM package support in it shortly
#23923 shipped in 470
RPM removal merged for Trino 471, so the Trino 470 release is the last RPM that is published to Maven Central. Website update will merge with the release.
The repo https://github.com/trinodb/trino-packages contains the setup to build a working RPM.
Geospatial is basically built-in (including engine support for spatial joins), so I think it should be in the default package.