dbx
dbx copied to clipboard
[REFACTORING] Moving towards v0.7.x
Context
During first year of dbx development, a lot of poor design decisions were made. Specifically, these mistakes are linked to the following parts (or components of the project):
Checklist
- [x] remove option to have a non-strict adjustment policy
- [ ] refactor configuration-oriented code and delete env variables support in YAML/JSON
- [ ] refactor configuration-oriented code and introduce proper full-scale support for Jinja2-based definitions
- [ ] Get rid of Jobs API 1.x support and fully switch to Jobs 2.1
- [ ] Redesign deployment/launch logic for Jobs 2.1
- [ ] Redesign execute logic for Jobs 2.1
- [ ] Get rid of old principle of supporting Python dependencies and move solely to .whl-based dependency management
- [ ] Drop the separate permissions functionality
- [ ] Global docs update
- [ ] Add Scala template
- [ ] Add R template
Changes to be made
Path adjustment policy
- In the beginning, path adjustment was simply based on the project directory, which led to inconsistencies and strange errors. In release v0.7.x non-strict path adjustment policy will be removed.
- FUSE modification shall be also supported.
Configuration inconsistencies
- concept to supercharge YAML and JSON with env variables was a BAD design idea, since such files cannot be reused in other YAML or JSON parsers. Therefore, we shall abandon this functionality in v0.7.X
- full scale and proper support for Jinja templating in YAML and JSON. Current implementation is heavily missing standard approaches of Jinja2, it needs to be reworked to properly support includes, macros and other standard Jinja2 elements.
- potentially, HOCON and Jsonnet support shall be introduced.
Jobs and Permissions API Support
- We shall get rid of Jobs v2.0 API support. Moving forward all jobs in dbx shall be in v2.1 format, with at least one task in place. It’s pretty complex to support both APIs in terms of coding time & investment. Also, it provides consistent and understandable set of tools for end users, and eliminates the need to use Permissions API separately.
- Separate Permissions API shall be disabled, since this functionality is already in-place for Jobs v2.1
Dependency management in Python
- Re-design the way we currently work with dependencies.
- We only support requirements.txt format, whilst there are multiple requests for Pipfile, poetry etc. this needs to be abstracted and re-worked to support most of the major dependency management frameworks.
General design and vocabulary improvements
- We currently use a very vague wording for various tasks (--files-only, --as-run-submit), jobs, tasks and "jobless" deployments. This needs to be re-worked.
- same thing shall be applied for deploy, launch and execute. Full-scale support for various parameter passing shall be also introduced
- Managing deployments shall be abstracted from using MLflow (we shall provide options which artifact location to choose (MLflow-based, dbfs based, pip-based).
- Passing parameters to execute and launch via CLI shall be also re-designed. Especially it's hard to make it consistent amongst different types of tasks & interfaces (execute vs launch).
Deploy launch and execute re-design
Sample designs:
Example deployment.yml:
custom:
...
environments:
default:
workloads:
- name: ...
<other workload properties>
- name: ...
<other workload properties>
snapshot examples:
# this is not affecting the Job definitions and creates ephemeral job runs
dbx deploy snapshot <workload-name>
dbx launch snapshot <workload-name>
job examples:
# this is a proper job deployment and run
dbx deploy job <workload-name>
dbx launch job <workload-name>
tasks examples:
# this will only execute the given task name
dbx execute <workload-name> --task=<some-task>
relevant discussion topic - https://github.com/databrickslabs/dbx/issues/277
@renardeinside really exciting stuff here.
you mentioned test suite is undergoing big refactor?
hi @justinTM , yes, but it's not relevant to v1.0.0 - I'm already refactoring the tests.
@renardeinside
Regarding Configuration inconsistencies with Jinja templating, I found the recent release v0.6.6 has already introduced --jinja-variables-file, is it done or you plan to still add/change the behavior in the future v0.7 ?
hi @copdips , yes, this is already implemented in v0.6.6+
@renardeinside What are the alternatives to using env variables? In our Azure DevOps environment, we depend on different environment variables that are used throughout our jobs and tasks. An example of such an environment variable is the run id of the DevOps run, which allows us to store files for individual and possible concurrent runs in separate places.
hi @mickvanhulst-TomTom ,
please take a look at Jinja2 support doc - there is support for both env and var-variables.