uyuni icon indicating copy to clipboard operation
uyuni copied to clipboard

Error loading Scheduled actions history after migrating to podman

Open digdilem-work opened this issue 1 year ago • 12 comments

Problem description

I’ve now test migrated Uyuni 2024-08 from RPM to Podman

One problem I’m seeing on the new vm is that every page of the “Scheduled” menu times out, eventually leading to “The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.”

Whilst doing this, two cpus are at 100% with postgres threads.

These pages still work on the original server but are slow, it’s possible I’ve extended these timeouts somewhere – and the reason seems fairly clear.

In both “Completed Actions” and “Archived Actions”, the ranges go from “1 - 500 of 10,000 (0 selected)” (Web urls /rhn/schedule/CompletedActions.do and ArchivedActions.do)

That’s a suspiciously round figure, so it’s probably more – at least 20,000 logs going back years. I don’t need these logs and would like to remove them – I think that should allow the container version to load those pages without timing out.

The Admin -> scheduled Tasks options are all completing okay; I had expected to see something in there to manage these logs.

How can I remove all but the past week or so of these logs in postgres, please?

Additionally – how do I stop them piling up in future?

(I did send this to the uyuni-users mailing list but got no reply after a couple of days, so retrying here)

Steps to reproduce

As above

Uyuni version

2024-08-Podman

Uyuni proxy version (if used)

No response

Useful logs

No response

Additional information

No response

digdilem-work avatar Sep 19 '24 10:09 digdilem-work

postgres logs can be found inside the container (so after running mgctl term) in /var/lib/pgsql/data/log/

mbussolotto avatar Sep 23 '24 10:09 mbussolotto

@mbussolotto Thank you. Those logs do have some errors - notably

2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]LOG:  could not accept SSL connection: unknown protocol
2024-10-01 07:11:29.980 BST [unknown] [unknown] [29465]HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.

However, these times don't match the issue I have which is when I manually click on any of the Schedules menu items. The Uyuni webui goes unresponsive for many minutes (at least six) until it eventually fails.

At this time, the cpu is at 100% for postgres

image

So it looks like postgres is simply taking too long to answer this query. The postgres database directory /var/lib/pgsql is 40Gb which feels very large and clear these tables have a LOT of historical information.

This works okay pre-migration to podman. Reponses via webui are a few seconds (less than 20), so something seems wrong.

I would like to prune that information - how do I do that, please?

digdilem-work avatar Oct 01 '24 13:10 digdilem-work

I restarted the vm since postgres was stuck at high load and tried again.

Webui -> Schedules (Expand) -> Failed Actions

After exactly 60 seconds, the webui fails with

image

Postgres continues using 100% CPU for one core for a long time.

digdilem-work avatar Oct 01 '24 13:10 digdilem-work

The postgres database directory /var/lib/pgsql is 40Gb which feels very large

This might be because of data-pg14: uyuni container use postgresql 16, so during migration db is upgraded as well (data-pg14 is the backup of the original data) can you please provide /var/lib/pgsql/data/postgresql.conf of the container (so after mgrctl term ) ?

mbussolotto avatar Oct 02 '24 07:10 mbussolotto

Thanks. I'm now fully post-migration to container and this is our live system.

Here's those contents, and a du of the db files, which are actually 5gb smaller than the old rpm-based system.

uyuni-server:/ # cat /var/lib/pgsql/data/postgresql.conf
max_connections = 400
shared_buffers = 3840MB
dynamic_shared_memory_type = posix
max_wal_size = 4096MB
min_wal_size = 1024MB
log_destination = 'stderr'
logging_collector = on
log_line_prefix = '%m %d %u [%p]'
log_timezone = 'Europe/London'
datestyle = 'iso, mdy'
timezone = 'Europe/London'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = 11264MB
work_mem = 10MB
maintenance_work_mem = 960MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
constraint_exclusion = off
wal_level = archive
max_wal_senders = 5
wal_keep_size = 1024
archive_mode = on
archive_command = '/bin/true'
max_locks_per_transaction = 100
standard_conforming_strings = 'off'
bytea_output = 'escape'
password_encryption = scram-sha-256
ssl = on
ssl_cert_file = '/etc/pki/tls/certs/spacewalk.crt'
ssl_key_file = '/etc/pki/tls/private/pg-spacewalk.key'
listen_addresses = '*'
random_page_cost = 4
effective_io_concurrency = 2
uyuni-server:/var/lib/pgsql # du -shc *
4.0K    analyze_new_cluster.sh
20G     data
20G     data-pg13
1.5G    data-pg14
4.0K    delete_old_cluster.sh
4.0K    initlog
40G     total

digdilem-work avatar Oct 02 '24 11:10 digdilem-work

Thanks, maybe it can explains the issue: we recently find out that postgresql.conf is not migrated during migration (I'm still looking at it, not sure which is the mgradm version that introduced the problem. If you have your old system, you can migrate manually your previous postgresql.conf file and see if it solves the issue. Otherwise you can try to change these values:

max_connections = 200
shared_buffers = 7680MB
effective_cache_size = 22528MB
work_mem = 40MB
maintenance_work_mem = 1024MB

EDIT:

IF you have data-pg14 folder in /var/lib/pgsql, you can copy the postgresq.conf from there (this is the original one present in your source sytem)

mbussolotto avatar Oct 02 '24 12:10 mbussolotto

Thank you for your time in trying to help me.

You were right here in that the conf hadn't transferred - the values were entirely different to the legacy server's.

Unfortunately, changing those values to both yours and the legacy ones didn't help. (I did reboot and check they had persisted)

The Pending/Failed/Completed schedules still cause things to timeout.

Strangely, the Archived table /does/ work, and contains many tens of thousands of entries. I'm currently looping through that in the webui and deleting 10k of these old schedules at a time. They go back several years.

Maybe once they're gone, postgres will have more luck displaying the others.

digdilem-work avatar Oct 02 '24 13:10 digdilem-work

k, so that's worked for me and I now have access to the scheduled tasks again.

I deleted hundreds of thousands of Archived Actions, 10k each. The database was surprisingly quick considering the numbers, and once the first 100k had gone, it got noticably more responsive. These dated back to early 2020 which is when I guess we moved to Uyuni from Spacewalk.

Once they were all gone, I could then open Failed Actions. There was another 7810 in there, which I could "Move to archive" and then "delete"

Finally, I started with "Completed Actions". Uyuni was now able to show this page without timing out and I've started the long process of Archiving these, and then will delete from from the Archived Actions. I've archived 200k Completeds so far.

Once everything is totally done, I'll try a vacuum to recover some db space.

So - in short, I have this working again. Thanks for your time.

Learning points:

  1. Almost certainly easier to delete these in the database.
  2. I need to find a way to automatically prune these going forwards. The reason this got so out of hand is that I assumed Uyuni was doing this itself and it clearly wasn't.

digdilem-work avatar Oct 02 '24 14:10 digdilem-work

@admd @mcalmer @rjmateus @cbosdo Besides the migration problem itself, I'm not aware of mechanism to prune really old actions...maybe is it something interesting that we can implement?

mbussolotto avatar Oct 02 '24 14:10 mbussolotto

@mbussolotto @digdilem-work I'm not aware of any mechanism to archive and delete those actions. But that is a good point and we could consider develop a clean-up procedure for this kind of stuff. We already have one in-place to clean older user sessions from the database

rjmateus avatar Oct 03 '24 13:10 rjmateus

Thank you.

I think it is worth doing. I estimate we had around a million scheduled events accrued over four years, so a scheduled event to truncate from the database feels like a sensible housekeeping feature.

If it's useful, our ideal would be to keep events for 30 days, but I imagine others have different needs. A variable for expiry days value would be the best solution.

digdilem-work avatar Oct 04 '24 10:10 digdilem-work

Additional: If a developer is poking around in scheduled events anyway, can I raise the question about whether Archived Events are even needed?

Having spent several hours tidying up this mess on our system, having to "Archive" each event (in batches on up to 10,000) and then move across to the Archived interface and then select the same events to "Delete" them doubled my work.

Each event is sorted into Pending, Completed and Failed for effective archive anyway. I'm not sure what the benefit of them being moved to another database before getting rid.

Ideally (from my perspective), the workflow might go:

  1. Scheduled event appears in "Pending".
  2. Once actioned, it is then moved into either "Completed" or "Failed".
  3. After $variable_days, the event is permanently deleted.

Then the event is available, both in those lists and the per-machine Event History, for $variable_days, and reduces the chance of them mounting up and affecting performance.

digdilem-work avatar Oct 04 '24 10:10 digdilem-work

Hello, this still applies to Uyuni 2024.12.

I had the same issue many years ago and wrote a clean-up script: https://github.com/stdevel/arsa This should still work with current Uyuni versions after adding the API version.

EDIT: I started refactoring the code here: https://github.com/stdevel/arsa/pull/12

stdevel avatar May 12 '25 11:05 stdevel

That looks like a worthy project, @stdevel - thank you for starting the refactor, I look forwards to trying it out.

digdilem-work avatar May 14 '25 11:05 digdilem-work

Thanks for the feedback, @digdilem-work! I did the testing yesterday and it worked for me. If it also works for you and the community is interested, I can also create a PR for uyuni-project/contrib

stdevel avatar May 14 '25 12:05 stdevel

I've just done some very brief testing from my usual remote api client and I'm pleased to say it "mostly works" in that it's correctly counted the completed, failed and archived tasks, and moved some stuff across to archived, but times out after 2 minutes.

First run: INFO:arsa.py:Found 10000 completed, 453 failed and 0 archived tasks (The ctrl+C'd out)

Second run: INFO:arsa.py:Found 4151 completed, 450 failed and 10000 archived tasks CTRL+C

This was taking a long time so I checked uyuni's host and "java" was at 100%, presumably responding to the above api calls.

I suspect I made this worse by cancelling it, and uyuni kept on doing stuff, so it may have amplified.

(>10k logs is about a week's worth for us, we generate a lot of these messages. I only cleared them out a couple of weeks ago)

Fourth run timed out with, probably due to the above heavy load, so I let it calm down.

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='ata-oxy-uyuni01.domain.com', port=443): Read timed out. (read timeout=120)

Fourth run: INFO:arsa.py:Found 0 completed, 449 failed and 10000 archived tasks This one timed out again. Again, java was busy on uyuni.

Fifth run: INFO:arsa.py:Found 0 completed, 449 failed and 10000 archived tasks

Again a timeout and heavy java load,

I then jumped onto uyuni and deleted the 1451 archived messages and re-ran arsa.

This time it finished quickly and normally;

INFO:arsa.py:Found 0 completed, 449 failed and 0 archived tasks

So it looks like "a lot" of archived messages causes it to take > 120s and timeout?

Running each time with: ./aysa.py --server=ata-oxy-uyuni01.domain.com -k

Thanks for your work - if it;'s run more often (daily?) then perhaps the above issues won't be a problem?

digdilem-work avatar May 14 '25 14:05 digdilem-work

Hey @digdilem-work, thanks for your feedback and effort! It looked exactly like this for another system I had my hands on this week - even though done from the UI.

Removing that many actions at the same time seems to cause issues. In the old version of the script I addressed this by splitting them into smaller bunches of 100: https://github.com/stdevel/arsa/blob/master/arsa.py#L135

I think adding this behavior and maybe a customizable timeout could help - what do you think?

stdevel avatar May 15 '25 06:05 stdevel

Certainly batching seems like something that's worth trying.

I ran arsa again this morning, without making any changes.

INFO:arsa.py:Found 1020 completed, 483 failed and 0 archived tasks

It's not timing out now, but it's been like this for 30 minutes and still going. The java process on uyuni is close to 100% for this duration.

I do think there is a core issue with how Uyuni handles these old schedule logs - hence the original reason for raising this. I suspect the issue here now is less with your script and the API, but the action triggered in Uyuni by the API.

Be great if more people could test this script, just to rule out something about our install being particularly tardy.

Kind of wish I knew more postgres so i could just expire old logs in the database after N days with a few quick queries...

digdilem-work avatar May 15 '25 06:05 digdilem-work

I just implemented parameters for splitting actions in bunches (--bunch-size) and waiting between handling bunches (--bunch-wait). By default a 30 second wait is added after handling 100 actions. Those parameters worked fine for my test system. But, to be fair, I don't have that much actions like you 😄

I agree, the root-cause for this issue should be fixed. Hope someone with database and backend knowledge from the team will find this issue.

stdevel avatar May 15 '25 09:05 stdevel

Hello again.

Thanks for your work on this - it's certainly good to know I'm not the only one with this issue, and it's caused me to look around at the problem again.

And... I think I've found an answer, and it's due to the Spacewalk greybeards of long ago.

It turns out that old cli command, spacecmd, has a bunch of functions related to scheduling.

schedule_archivecompleted
schedule_cancel
schedule_deletearchived
schedule_details
schedule_getoutput
schedule_list
schedule_listarchived
schedule_listcompleted
schedule_listfailed
schedule_listpending
schedule_reschedule

These appear to work perfectly with Uyuni, and with some nicely scriptable commands, I'm able to move completed schedules to Archive, and expire those archived ones based on age. I can list schedules and use a little scripting to automate removal of the other types, and those pesky "Pending" ones that get stuck. It's even smart enough to batch huge numbers (ie, do the first 1000, then the next)

Despite having used Spacewalk and Uyuni for years, and using spacecmd for regular patching, I somehow overlooked these until now, and I think that this will now be my way forwards.

Thanks again, @stdevel

digdilem-work avatar May 21 '25 10:05 digdilem-work