versatile-data-kit control-service: refactor job cancellation method due to 404 errors

what: Added a new interface method in the job execution repository that retrieves the executions by comparing ID, data job name and data job team. Switched the Spring data interface methods used in the KubernetesService class' cancelJobExecution method. Fixed a few log calls which had reversed variables for printing e.g it would pring the job instead of the team name and vice versa.

why: The previous implementation would only compare for execution ID, allowing users to cancel job executions of other team/job by providing the correct execution. Moreover the default method would produce a query with INNER JOIN data_job datajob1_ on the DataJob table which in some cases would return 404 not found errors when attempting to cancel job executions. After debugging we noticed that the listExecutions method is not affected by this pitfall and it's query uses a LEFT OUTER join, which the new method also uses. Note below is the old findByID method's produced query and the new method's query for comparison:

SELECT datajobexe0_.id                                  AS id1_1_0_,
       datajobexe0_.job_name                            AS job_nam18_1_0_,
       datajobexe0_.end_time                            AS end_time2_1_0_,
       datajobexe0_.job_schedule                        AS job_sche3_1_0_,
       datajobexe0_.job_version                         AS job_vers4_1_0_,
       datajobexe0_.last_deployed_by                    AS last_dep5_1_0_,
       datajobexe0_.last_deployed_date                  AS last_dep6_1_0_,
       datajobexe0_.message                             AS message7_1_0_,
       datajobexe0_.op_id                               AS op_id8_1_0_,
       datajobexe0_.resources_cpu_limit                 AS resource9_1_0_,
       datajobexe0_.resources_cpu_request               AS resourc10_1_0_,
       datajobexe0_.resources_memory_limit              AS resourc11_1_0_,
       datajobexe0_.resources_memory_request            AS resourc12_1_0_,
       datajobexe0_.start_time                          AS start_t13_1_0_,
       datajobexe0_.started_by                          AS started14_1_0_,
       datajobexe0_.status                              AS status15_1_0_,
       datajobexe0_.type                                AS type16_1_0_,
       datajobexe0_.vdk_version                         AS vdk_ver17_1_0_,
       datajob1_.NAME                                   AS name1_0_1_,
       datajob1_.enabled                                AS enabled2_0_1_,
       datajob1_.db_default_type                        AS db_defau3_0_1_,
       datajob1_.description                            AS descript4_0_1_,
       datajob1_.enable_execution_notifications         AS enable_e5_0_1_,
       datajob1_.generate_keytab                        AS generate6_0_1_,
       datajob1_.name_deprecated                        AS name_dep7_0_1_,
       datajob1_.notification_delay_period_minutes      AS notifica8_0_1_,
       datajob1_.notified_on_job_deploy                 AS notified9_0_1_,
       datajob1_.notified_on_job_failure_platform_error AS notifie10_0_1_,
       datajob1_.notified_on_job_failure_user_error     AS notifie11_0_1_,
       datajob1_.notified_on_job_success                AS notifie12_0_1_,
       datajob1_.schedule                               AS schedul13_0_1_,
       datajob1_.team                                   AS team14_0_1_,
       datajob1_.last_execution_duration                AS last_ex15_0_1_,
       datajob1_.last_execution_end_time                AS last_ex16_0_1_,
       datajob1_.last_execution_status                  AS last_ex17_0_1_,
       datajob1_.latest_job_deployment_status           AS latest_18_0_1_,
       datajob1_.latest_job_execution_id                AS latest_19_0_1_,
       datajob1_.latest_job_termination_status          AS latest_20_0_1_
FROM   data_job_execution datajobexe0_
       INNER JOIN data_job datajob1_
               ON datajobexe0_.job_name = datajob1_.NAME
WHERE  datajobexe0_.id = ?

And this is the query produced by the new method in the interface:

SELECT datajobexe0_.id                       AS id1_1_,
       datajobexe0_.job_name                 AS job_nam18_1_,
       datajobexe0_.end_time                 AS end_time2_1_,
       datajobexe0_.job_schedule             AS job_sche3_1_,
       datajobexe0_.job_version              AS job_vers4_1_,
       datajobexe0_.last_deployed_by         AS last_dep5_1_,
       datajobexe0_.last_deployed_date       AS last_dep6_1_,
       datajobexe0_.message                  AS message7_1_,
       datajobexe0_.op_id                    AS op_id8_1_,
       datajobexe0_.resources_cpu_limit      AS resource9_1_,
       datajobexe0_.resources_cpu_request    AS resourc10_1_,
       datajobexe0_.resources_memory_limit   AS resourc11_1_,
       datajobexe0_.resources_memory_request AS resourc12_1_,
       datajobexe0_.start_time               AS start_t13_1_,
       datajobexe0_.started_by               AS started14_1_,
       datajobexe0_.status                   AS status15_1_,
       datajobexe0_.type                     AS type16_1_,
       datajobexe0_.vdk_version              AS vdk_ver17_1_
FROM   data_job_execution datajobexe0_
       LEFT OUTER JOIN data_job datajob1_
                    ON ( datajobexe0_.job_name = datajob1_.NAME )
WHERE  datajobexe0_.id = ?
       AND datajob1_.NAME = ?
       AND datajob1_.team = ?

testing: We already have a unit test that covers job cancellations, which contained no changed tests locally.

Signed-off-by: Momchil Zhivkov [email protected]

Aug 29 '22 12:08 mrMoZ1

Tests are missing. I assume you've planned some tests?

No since we already have a test that covers cancellation functionality and all tests pass.

Aug 30 '22 11:08 mrMoZ1

No since we already have a test that covers cancellation functionality and all tests pass.

Well of course not. One can never have a bug that is covered by our tests. By definition, our tests would have caught it before it became a bug. Clearly, we have a gap in the tests.

We need to reproduce the issue in a test . Otherwise we have no insurance against regression - that the issue will re-surface again in the future. It also gives us more confidence that the buf fix is working (especially when we are not even sure). In this case, most likely some kind of integration test would be meaningful.

Aug 30 '22 12:08 antoniivanov

versatile-data-kit versatile-data-kit copied to clipboard

control-service: refactor job cancellation method due to 404 errors

versatile-data-kit
versatile-data-kit copied to clipboard