awx icon indicating copy to clipboard operation
awx copied to clipboard

Altering events relationship to hosts to increase performance

Open john-westcott-iv opened this issue 2 years ago • 11 comments

SUMMARY

Altering ForeignKey to an integer field for JobEvents. Dropping cascade set null to do nothing on AdHocCommandEvent

There were reported issues around cascading on lockups and DB issues when deleting a host that was related to many events. This used to be a ForeignKey constraint that was "lost" at the DB layer when we partitioned the table. However, there was still a Django action to set the event hosts to null on an inventory delete.

This change should help prevent issue like that because there will be no cascading or setting null. Instead, we will just let the DB be referentially inconsistent.

Because of the lack of foreign key, in order to keep the job_event detail page to have the host in its relations we had to add a lookup in the serializer and remove the related field from the access model as well as some other changes.

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME
  • API
AWX VERSION
awx: 21.1.1.dev135+gdda69ca2d7.d20220630
ADDITIONAL INFORMATION

john-westcott-iv avatar Jun 30 '22 20:06 john-westcott-iv

May address #12277 #12176 #10337

john-westcott-iv avatar Jun 30 '22 20:06 john-westcott-iv

The failed api-schema is because host has changed from (id) to (integer).

john-westcott-iv avatar Jul 01 '22 12:07 john-westcott-iv

One thing that might be useful here is that you can change some Django settings to print the SQL it runs as it migrates. That might be helpful here, because the event table is massively huge in many deployments, and we want to identify anything that will affect the whole thing. Field re-definitions don't always result a database schema, so it's kind of hard to tell. We probably need to get some estimates of how long it will take as well.

AlanCoding avatar Jul 01 '22 14:07 AlanCoding

Can you make the PR title a little more descriptive? Remember that these end up in our release notes.

shanemcd avatar Jul 01 '22 14:07 shanemcd

It also seems weird that our historical modeling only has the relationship of hosts to ad hoc commands through the events table.

Why would this seem weird? We don't parse the provided limit field, thus, we only know experimentally whether an ad hoc command will operate on a host. This is kind of true for jobs too, but those are linked by doing extra work at the end of a job, creating the JobHostSummary objects. If an ad hoc command can't loop (which makes sense), then the extra intermediary of host summaries should be unnecessary, so this all makes sense to me.

Also, this PR has been changed substantially since the last round of comments. The db_constraint approach is interesting.

https://docs.djangoproject.com/en/4.0/ref/models/fields/

ForeignKey.db_constraint

Controls whether or not a constraint should be created in the database for this foreign key. The default is True, and that’s almost certainly what you want; setting this to False can be very bad for data integrity. That said, here are some scenarios where you might want to do this:

  • You have legacy data that is not valid.
  • You’re sharding your database.

If this is set to False, accessing a related object that doesn’t exist will raise its DoesNotExist exception.

At a quick reading, those exceptions seem plausibly valid for our situation, and could be promising. We should look for some confirmation this is working from integration testing.

AlanCoding avatar Jul 06 '22 19:07 AlanCoding



Test summary

660 0 779 0Flakiness 2


Run details

Project AWX - Functional
Status Passed
Commit d01e2649fb
Started Jul 21, 2022 2:36 PM
Ended Jul 21, 2022 4:45 PM
Duration 09:01 💡
OS Linux Debian - 11.3
Browser Chrome 99

View run in Cypress Dashboard ➡️


Flakiness

workflows/workflow-visualizer/workflow-viz-wfjt-node-prompt-and-survey.spec.js Flakiness
1 Workflow node within a workflow > can prompt and survey against a workflow node with empty visualizer and save
schedules/schedule-operations.spec.js Flakiness
1 Schedules Resource- basic search and list functionality > can sort by name

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

cypress[bot] avatar Jul 07 '22 21:07 cypress[bot]

Tested and approved by QE Clean yolo runs.

jay-steurer avatar Jul 13 '22 16:07 jay-steurer

@john-westcott-iv one last question, just to make sure. Did you test the case where you run a job, delete a host it touched, and then check that the output still works? If that's good, then I'm totally in favor of this.

AlanCoding avatar Jul 15 '22 15:07 AlanCoding

Oh, sorry, I see that you've already documented that. Looks good.

AlanCoding avatar Jul 15 '22 15:07 AlanCoding

@john-westcott-iv Just an update, I am working on testing this in our Perf and Scale environment with following number of job events in the DB:

  • 10 million
  • 50 million
  • 1 billion

Once I have DB populated, I am doing the migration with this change. So far 10 and 50 million looks fine: It takes about ~6-10 seconds to do the migration:

TASK [ansible.automation_platform_installer.automationcontroller : Migrate the database schema (may take awhile when upgrading).] ***
^[[0;33mchanged: [128.168.128.123] => {"changed": true, "cmd": ["awx-manage", "migrate", "--noinput"], "delta": "0:00:04.226824", "end": "2022-07-13 07:16:23.824669", "msg": "", "rc": 0, "start": "2022-07-13 07:16:19.597845", "stderr": "", "stderr_lines": [], "stdout": "Operations to perform:\n  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit\nRunning migrations:\n  Applying main.0160_alter_schedule_rrule... OK\n  Applying main.0161_unifiedjob_host_status_counts... OK\n  Applying main.0162_alter_unifiedjob_dependent_jobs... OK\n  Applying main.0163_convert_job_tags_to_textfield... OK\n  Applying main.0164_remove_inventorysource_update_on_project_update... OK\n  Applying main.0165_alter_jobevent_host... OK", "stdout_lines": ["Operations to perform:", "  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit", "Running migrations:", "  Applying main.0160_alter_schedule_rrule... OK", "  Applying main.0161_unifiedjob_host_status_counts... OK", "  Applying main.0162_alter_unifiedjob_dependent_jobs... OK", "  Applying main.0163_convert_job_tags_to_textfield... OK", "  Applying main.0164_remove_inventorysource_update_on_project_update... OK", "  Applying main.0165_alter_jobevent_host... OK"]}^[[0m 

jainnikhil30 avatar Jul 25 '22 13:07 jainnikhil30

When this PR is merged and released ask https://github.com/ansible/awx/issues/12520 to test.

john-westcott-iv avatar Aug 03 '22 15:08 john-westcott-iv

@john-westcott-iv

Seems like we have an issue. This migration doesn't work now.

TASK [ansible.automation_platform_installer.automationcontroller : Migrate the database schema (may take awhile when upgrading).] *** fatal: [128.168.128.214]: FAILED! => {"changed": true, "cmd": ["awx-manage", "migrate", "--noinput"], "delta": "0:00:01.997372", "end": "2022-08-16 06:58:14.104424", "msg": "non-zero return code", "rc": 1, "start": "2022-08-16 06:58:12.107052", "stderr": "CommandError: Conflicting migrations detected; multiple leaf nodes in the migration graph: (0165_alter_jobevent_host, 0165_task_manager_refactor in main).\nTo fix them run 'python manage.py makemigrations --merge'", "stderr_lines": ["CommandError: Conflicting migrations detected; multiple leaf nodes in the migration graph: (0165_alter_jobevent_host, 0165_task_manager_refactor in main).", "To fix them run 'python manage.py makemigrations --merge'"], "stdout": "", "stdout_lines": []}

Looks like there was another 0165 (https://github.com/ansible/tower/blob/devel/awx/main/migrations/0165_task_manager_refactor.py) migration that went into devel. You probably have to rename this to 0166 ?

jainnikhil30 avatar Aug 16 '22 11:08 jainnikhil30

Talking with @jainnikhil30 and he says he's tested with up to 500 million events and that he sees no problem with this

kdelee avatar Aug 16 '22 12:08 kdelee

maybe you need to rebase?

django.db.migrations.exceptions.NodeNotFoundError: Migration main.0166_alter_jobevent_host dependencies reference nonexistent parent node ('main', '0165_task_manager_refactor.py')

kdelee avatar Aug 16 '22 12:08 kdelee

With 500 million following is what we get:

Running migrations: Applying main.0165_task_manager_refactor... OK (0.359s) Applying main.0166_alter_jobevent_host... OK (0.212s)

This looks good to me. I am good with merging this.

jainnikhil30 avatar Aug 16 '22 12:08 jainnikhil30