Facade is not picking up jobs until Augur is restarted following core collection
This has been the case for years, I believe... and is probably something that should have an issue open if we don't already have one.
- @sgoggins in slack
to me it sounds like the repro steps for this are:
- start augur fresh
- add a new repo via the UI
- wait for core collection to finish
- observe secondary collection finish, but facade does not start
- restart augur
- observe facade collect on the newly added repo
i believe the expected behavior is that facade runs parallel with secondary (assuming sufficient available workers)
@sgoggins could you retest this to confirm whether it is still true?
@MoralCode I use v0.90.3, It is still the same. This behavior is not kind to new comer. when I met this problem, I think it's a bug, and I want to raise an inssue in github
Thanks for the info! Can you confirm which repositories you were using, what steps you followed and/or what environment you are running in?
I believe this is actually a symptom of #3319. I posted more details there, but i think here's what's happening:
When facade tasks get orphaned (due to RabbitMQ connection closures), they stay stuck with facade_status = 'Collecting' in the database. The collection monitor counts these when checking capacity and thinks all 30 worker slots are full, so it refuses to schedule any new facade tasks. Core and secondary run fine because they use separate worker pools and aren't affected by the orphaned facade tasks.
When we restart Augur, clean_collection_status() runs at startup and resets all the stuck tasks. That's why facade suddenly works after restart.
The collection_status record is created immediately when we add the repo via UI so it's not a timing issue with the daily create_collection_status_records task.
Fix: I think we need the same cleanup logic running periodically (not just at startup) to detect and reset tasks stuck in Collecting state. That should prevent the capacity from getting falsely exhausted.
I disagree that this is related, but i guess that can easily be tested with a clean install.
It sounds like your theory depends on there being a repo that is being processed that is large enough to cause the celery timeout, but to me it sounded like this issue would occur even on a fresh instance (where the default set of repos arent large enough to trigger #3319).
Issue 2: clone_repos() self-scheduling limitation (fresh install scenario)
if all other facade tasks are dependent on facade_status being set to "Update" then i think you found the core problem in this issue (AND the relevant lines of code)
@sgoggins do you think this is an accurate diagnosis of the issue?
i discussed this a bit with cali, and we noticed that some repos may have null data in the commit_sum column of augur_operations.collection_status that may somehow be a reason why facade is not running?
this isnt a perfectly clean test scenario but figured id mention.
There seems to be some weirdness with this field since i think it was associated with the old weight-based scheduling system that was since commented out as a scheduled task, but may still be getting called somewhere else to cause commit_sum to be updated.