readthedocs.org
readthedocs.org copied to clipboard
Notifications: duplicate messages
Noticed here: https://beta.readthedocs.org/projects/xclim/builds/23494705/
The error messages seem to stack here:
Also not sure why the string interpolation fails for these, I've noticed that in a few spots though.
On the current dashboard there are no notifications though, not sure why.
https://readthedocs.org/projects/xclim/builds/23494705/
cc @humitos I don't know if we've fixed this yet or not
Without jumping too much on this yet:
- for some reason the build is not being reset properly when it's retried. We have code that cleanup all the notifications: https://github.com/readthedocs/readthedocs.org/blob/0a381a7acfc7683755ad0af3ad652c3b2f73fe25/readthedocs/builds/models.py#L1092-L1109
- since the notification is exactly the same, it should be de-duplicated when it's added. However, when checking the APIv2 view (the one used by the builders), it doesn't seem we are using our custom
.add()method that de-duplicates notifications: https://github.com/readthedocs/readthedocs.org/blob/0a381a7acfc7683755ad0af3ad652c3b2f73fe25/readthedocs/api/v2/views/model_views.py#L382-L399
So, working on 2) to call .add() should be the first point to attack here. I suppose we should overwrite the api.v2.NotificationSerializer.save() method to call our custom .add().
On the current dashboard there are no notifications though, not sure why. readthedocs.org/projects/xclim/builds/23494705
I think this is because we implemented permissions on notifications in the last deploy (see https://github.com/readthedocs/readthedocs.org/pull/11117) and there may be a bug on it because those notifications should be shown to all the users, since the build is public.
Also not sure why the string interpolation fails for these, I've noticed that in a few spots though.
In [12]: Notification.objects.filter(message_id=BuildMaxConcurrencyError.LIMIT_REACHED, format_values={}).count()
Out[12]: 4595
In [13]: Notification.objects.filter(message_id=BuildMaxConcurrencyError.LIMIT_REACHED).exclude(format_values={}).count()
Out[13]: 43746
It seems somewhere we are creating this notification without the proper format_values. 10% of them are broken. I suppose there should be a Celery workflow that we are not handling correctly here. I doubt it's this one in particular https://github.com/readthedocs/readthedocs.org/blob/0a381a7acfc7683755ad0af3ad652c3b2f73fe25/readthedocs/projects/tasks/builds.py#L362-L374
This is probably worth a separate issue, but I've noticed these build concurrency notifications on a number of builds too. Shouldn't these notifications clear after the build is successfully started?
I've noticed these in a few spots, but it's hard for me to tell if this should be resolved now or not. The above error is on my local instance on a finished build -- same with the original errors I posted here.
I suppose also these should be warning level notifications? That would at least help make it clear that the build is not failed.
I suppose also these should be warning level notifications? That would at least help make it clear that the build is not failed.
I've done this in https://github.com/readthedocs/readthedocs.org/pull/11196
Shouldn't these notifications clear after the build is successfully started?
Yes, I've mentioned this in 1) from my previous comment https://github.com/readthedocs/readthedocs.org/issues/11131#issuecomment-1956118651
So, working on 2) to call
.add()should be the first point to attack here. I suppose we should overwrite theapi.v2.NotificationSerializer.save()method to call our custom.add().
This is done in https://github.com/readthedocs/readthedocs.org/pull/11197
Now, with the two PRs I've opened I think this issue shouldn't be an issue anymore. However, 1) could be still present sometimes due to network issues/web instance congestion or similar scenarios where the reset API call fails to perform. That said, I'd would move forward with what we have keeping the issue open for now in case we continue seeing it but I wouldn't invest too much time on digging further on debugging for now.
Yes, I've mentioned this in 1) from my previous comment https://github.com/readthedocs/readthedocs.org/issues/11131#issuecomment-1956118651
That comment implied to me that the messages would only be deduplicated with the fix -- ie:
since the notification is exactly the same, it should be de-duplicated when it's added.
Are you saying that this fix is also what will dismiss the notification once the build has started?
To clarify, what I would expect here is that while the build is still queued that warning message should show, but once the build is no longer queued due to concurrency limits the message will disappear.
what I would expect here is that while the build is still queued that warning message should show, but once the build is no longer queued due to concurrency limits the message will disappear.
Yes, this is how it should work.
However, even with all these fixes... the scenario I mentioned before, may be still present on some edge cases 👇🏼
However, 1) could be still present sometimes due to network issues/web instance congestion or similar scenarios where the
resetAPI call fails to perform.
I found this issue in https://readthedocs.org/projects/docs/builds/23722719/ as well. This was a recent build in our own documentation. I still don't understand why this is happening, but it's worth investigating it a little more.
We should probably run a small Python code from Django shell to remove these notifications from builds that are in finished state.
This would be Python code we could run in production to remove these invalid notifications from builds:
from readthedocs.builds.constants import (
BUILD_STATE_FINISHED,
BUILD_STATE_CANCELLED,
)
from readthedocs.doc_builder.exceptions import BuildMaxConcurrencyError
for n in Notification.objects.filter(message_id=BuildMaxConcurrencyError.LIMIT_REACHED).prefetch_related("attached_to"):
if n.attached_to.state in (BUILD_STATE_FINISHED, BUILD_STATE_CANCELLED):
n.delete()
I think I found why we are not resetting the build in some cases:
https://github.com/readthedocs/readthedocs.org/blob/a0526ff7959f2fceb4e69512312dff4f5d7f8096/readthedocs/projects/tasks/builds.py#L449-L453
We are only resetting it if has at least one command already. We should remove that restriction 👍🏼
The duplicated "concurrecy limit" notifications deleted and they should not appear again in new builds. As example, the old build linked in this issue doesn't show it anymore: https://beta.readthedocs.org/projects/xclim/builds/23494705/