versatile-data-kit
versatile-data-kit copied to clipboard
control-service: refactor job cancellation method due to 404 errors
what: Added a new interface method in the job execution repository that retrieves the executions by comparing ID, data job name and data job team. Switched the Spring data interface methods used in the KubernetesService class' cancelJobExecution method. Fixed a few log calls which had reversed variables for printing e.g it would pring the job instead of the team name and vice versa.
why: The previous implementation would only compare for execution ID, allowing users to cancel job executions of other team/job by providing the correct execution. Moreover the default method would produce a query with INNER JOIN data_job datajob1_ on the DataJob table which in some cases would return 404 not found errors when attempting to cancel job executions. After debugging we noticed that the listExecutions method is not affected by this pitfall and it's query uses a LEFT OUTER join, which the new method also uses. Note below is the old findByID method's produced query and the new method's query for comparison:
SELECT datajobexe0_.id AS id1_1_0_,
datajobexe0_.job_name AS job_nam18_1_0_,
datajobexe0_.end_time AS end_time2_1_0_,
datajobexe0_.job_schedule AS job_sche3_1_0_,
datajobexe0_.job_version AS job_vers4_1_0_,
datajobexe0_.last_deployed_by AS last_dep5_1_0_,
datajobexe0_.last_deployed_date AS last_dep6_1_0_,
datajobexe0_.message AS message7_1_0_,
datajobexe0_.op_id AS op_id8_1_0_,
datajobexe0_.resources_cpu_limit AS resource9_1_0_,
datajobexe0_.resources_cpu_request AS resourc10_1_0_,
datajobexe0_.resources_memory_limit AS resourc11_1_0_,
datajobexe0_.resources_memory_request AS resourc12_1_0_,
datajobexe0_.start_time AS start_t13_1_0_,
datajobexe0_.started_by AS started14_1_0_,
datajobexe0_.status AS status15_1_0_,
datajobexe0_.type AS type16_1_0_,
datajobexe0_.vdk_version AS vdk_ver17_1_0_,
datajob1_.NAME AS name1_0_1_,
datajob1_.enabled AS enabled2_0_1_,
datajob1_.db_default_type AS db_defau3_0_1_,
datajob1_.description AS descript4_0_1_,
datajob1_.enable_execution_notifications AS enable_e5_0_1_,
datajob1_.generate_keytab AS generate6_0_1_,
datajob1_.name_deprecated AS name_dep7_0_1_,
datajob1_.notification_delay_period_minutes AS notifica8_0_1_,
datajob1_.notified_on_job_deploy AS notified9_0_1_,
datajob1_.notified_on_job_failure_platform_error AS notifie10_0_1_,
datajob1_.notified_on_job_failure_user_error AS notifie11_0_1_,
datajob1_.notified_on_job_success AS notifie12_0_1_,
datajob1_.schedule AS schedul13_0_1_,
datajob1_.team AS team14_0_1_,
datajob1_.last_execution_duration AS last_ex15_0_1_,
datajob1_.last_execution_end_time AS last_ex16_0_1_,
datajob1_.last_execution_status AS last_ex17_0_1_,
datajob1_.latest_job_deployment_status AS latest_18_0_1_,
datajob1_.latest_job_execution_id AS latest_19_0_1_,
datajob1_.latest_job_termination_status AS latest_20_0_1_
FROM data_job_execution datajobexe0_
INNER JOIN data_job datajob1_
ON datajobexe0_.job_name = datajob1_.NAME
WHERE datajobexe0_.id = ?
And this is the query produced by the new method in the interface:
SELECT datajobexe0_.id AS id1_1_,
datajobexe0_.job_name AS job_nam18_1_,
datajobexe0_.end_time AS end_time2_1_,
datajobexe0_.job_schedule AS job_sche3_1_,
datajobexe0_.job_version AS job_vers4_1_,
datajobexe0_.last_deployed_by AS last_dep5_1_,
datajobexe0_.last_deployed_date AS last_dep6_1_,
datajobexe0_.message AS message7_1_,
datajobexe0_.op_id AS op_id8_1_,
datajobexe0_.resources_cpu_limit AS resource9_1_,
datajobexe0_.resources_cpu_request AS resourc10_1_,
datajobexe0_.resources_memory_limit AS resourc11_1_,
datajobexe0_.resources_memory_request AS resourc12_1_,
datajobexe0_.start_time AS start_t13_1_,
datajobexe0_.started_by AS started14_1_,
datajobexe0_.status AS status15_1_,
datajobexe0_.type AS type16_1_,
datajobexe0_.vdk_version AS vdk_ver17_1_
FROM data_job_execution datajobexe0_
LEFT OUTER JOIN data_job datajob1_
ON ( datajobexe0_.job_name = datajob1_.NAME )
WHERE datajobexe0_.id = ?
AND datajob1_.NAME = ?
AND datajob1_.team = ?
testing: We already have a unit test that covers job cancellations, which contained no changed tests locally.
Signed-off-by: Momchil Zhivkov [email protected]
Tests are missing. I assume you've planned some tests?
No since we already have a test that covers cancellation functionality and all tests pass.
No since we already have a test that covers cancellation functionality and all tests pass.
Well of course not. One can never have a bug that is covered by our tests. By definition, our tests would have caught it before it became a bug. Clearly, we have a gap in the tests.
We need to reproduce the issue in a test . Otherwise we have no insurance against regression - that the issue will re-surface again in the future. It also gives us more confidence that the buf fix is working (especially when we are not even sure). In this case, most likely some kind of integration test would be meaningful.