aleph
aleph copied to clipboard
FEATURE: Incident communication
Is your feature request related to a problem? Please describe. Aleph sometimes has issues that degrade user experience, e.g. ingest tasks being processed slower than unusual or temporarily inconsistent data. Also, we sometimes need to do maintenance that can result in planned (partial) downtime. Currently, we communicate this to Aleph users via channels like email or Slack. Not everyone receives and reads these messages.
Without an easy way to communicate with our active users, we increase the chance that we'll prevent them from being able to perform the work that they are doing, increase the number of requests that we get for support, or waste users time as they search for solutions to problems that are not of their own creation.
This is a problem that members of the Aleph community managing their own instances face as well.
Describe the solution you'd like We would like to implement a message banner that is visible on all pages and to all users that can be used to inform users about incidents and upcoming maintenance. We also want to implement a UI that allows admins to enable and disable the message banner.
- [ ] The message banner should be color coded based on the severity of the incident:
- Green: Operational / back to normal
- Blue: Info / upcoming maintenance
- Orange: Degraded performance
- [ ] There must be an area on every page in Aleph where we can display system status along with a message to users explaining what the system status is
- [ ] It must be possible to set a color for this area, based on a selection, to indicate the importance of the message
- [ ] It must be possible to set a message for this area, something which explains what's going on
- [ ] Setting messages must only be available to Aleph administrators
- [ ] Messages should be able to be plain text or html so that they can include links etc.
- [ ] It would be nice if to track previous messages, when they happened, and how important they were
- [ ] It would be nice to cancel messages with a single click
- [ ] It would be nice to communicate situation normal for a fixed period of time after a message is cancelled
Describe alternatives you've considered We have considered integrating with a separate managed or self-hosted service like Cachet or Statuspage.io. These service meet our requirements and we could easily display an in-app banner when there are incidents using the services’ APIs. I signed up for a trial account with Statuspage and it actually does a lot more than what we need, which also means that the UI is a little less straight-forward than it could be for our use case.
Also, this is something that will be helpful for community members managing their own Aleph instances. Using a separate service means that they’d set up and potentially manage another service next to Aleph itself, and configure Aleph to integrate with that service.
In order to keep it simple, we have decided to build this feature directly into Aleph.
Additional context
- Global settings are currently configured with environment variables, i.e. there is no admin settings UI. We’d have to create a new page for this.
- We need to consider users that keep Aleph open in their browsers for a very long time. We might need to poll for new messages in a fixed interval.
- We need to decide whether messages should be dismissable. We do not want them to be too obstrusive. On the other hand, there’s a risk that users dismiss a message without reading it or forget about it.
Example messages
Degraded ingest performance Processing ingested files currently takes longer than usual. If you have recently uploaded new document that haven’t been processed completely, please check back later.
Planned downtime on Sun, 2022-07-10, 9-10am CET We will upgrade our server infrastructure to ensure Aleph can continue to handle increasing workloads. Aleph will be unavailable during this time frame.
Public access temporarily disabled We have disabled public access to prevent Aleph from being overwhelmed by automated anonymous requests. We will resume providing public access soon.
Implementation I’m not exactly sure how to name this feature. Notifications and alerts are already taken, but messages might be a little ambiguous.
We probably want to store the following information for each message:
- id
- level
- title
- description
- created at
- resolved at
- (created by)
We need a new API endpoint that allows for basic CRUD operations to create, retrieve, and update messages.
GET /– List all messages- (
GET /current– Return the most recent, unresolved message) POST /– Create a new message (and optionally resolve the current message?)PUT /:id– Update an existing message (e.g. to resolve it)
I like the general idea of having a message banner to communicate with users. I have a couple of points that I would like us to consider:
- One possible downside of this approach of defining app status in the app itself is that if the app is impaired, it may not be possible to define a status message through the UI easily.
- Most of the desired features described above can be achieved through defining the banner message in a setting and exposing that through metadata endpoint. This will give us a way to communicate with the users without adding a new model. This won't help us track previous messages in the app itself but imo that's not a deal breaker feature.
Most of the desired features described above can be achieved through defining the banner message in a setting and exposing that through metadata endpoint.
Wait, do we have global settings in Aleph (that aren’t defined via environment variables)?
Wait, do we have global settings in Aleph (that aren’t defined via environment variables)?
No we don't. I was talking about environment variable settings only. And a setting like that is already there as ALEPH_APP_BANNER in https://github.com/alephdata/aleph/blob/0772510eb588d29e5bc4ef044686cde4cfe77f40/aleph/settings.py#L39 and https://github.com/alephdata/aleph/blob/0772510eb588d29e5bc4ef044686cde4cfe77f40/ui/src/components/Screen/Screen.jsx#L63.
It looks like this in the UI:

We can tweak the ui presentation and add an additional setting to control the colour code / severity. What do you think?
No we don't. I was talking about environment variable settings only. And a setting like that is already there as ALEPH_APP_BANNER in
Ah, thanks for clarifying. I might be wrong, but my understanding was that this is too cumbersome and we’d like to have solution that allows for easy configuration of the message via the UI. @Rosencrantz, can you clarify as you’re probably the larger part of the target audience for this feature?
One possible downside of this approach of defining app status in the app itself is that if the app is impaired, it may not be possible to define a status message through the UI easily.
Yes, good you’re pointing that out! I had assumed this is tolerable, as unplanned full outages didn't occur that often in the past, while degraded performance is a much more common issue. But we haven’t discussed this explicitly yet.
Yes, that is my thinking. Redeploying is a bit of a pain for just updating a banner, and when you deploy, there is always the possibility that you screw up something and cause a bigger problem than you already had.
One further option beyond those that we've already discussed here might be to make use of a microservice for communicating this information. It could exist seperately from Aleph and not be impacted the kinds of issues that we normally suffer from. Obvious downsides are the complexity of that extra moving piece and how that might impact things in the community
One further option beyond those that we've already discussed here might be to make use of a microservice for communicating this information. It could exist seperately from Aleph and not be impacted the kinds of issues that we normally suffer from.
@Rosencrantz Hm, while I like that we wouldn’t add the feature directly to Aleph, we’d have to build it ourselves and it will be more complicated for the community to make use of it. We could also use an off-the-shelf solution then, right?
Something I have been thinking on that has the following advantages:
- We do not have to add this feature to Aleph itself.
- We could build a simple UI to create/resolve status messages.
- We wouldn’t have to pay for or manage separate resources for the service.
- Users with their own Aleph instances could simply clone a repository get their own simple status service.
- We could display status updates not only in-app, but also on a separate HTML page that is accessible even if Aleph is down.
-
Set up a separate GitHub repository that basically contains only a YAML file with a list of events/messages:
- created_at: 2022-07-25 11:13 level: degraded title: Degraded performance message: Processing ingested files currently takes longer than usual. If you have recently uploaded new document that haven’t been processed completely, please check back later. - created_at: 2022-07-01 12:02 resolved_at: 2022-07-05 14:00 level: info title: Upcoming maintenance message: ... ... -
Create a GitHub Actions Workflow that automatically deploys a static JSON file with any unresolved messages to GitHub Pages, Netlify, or a similar service that could then be consumed as an API by the frontend.
Optionally, we could
-
… create a simple UI that uses the GitHub API to create/resolve new status messages, or
-
… set up an Actions Workflow that creates and resolves status messages whenever you open/close a GitHub issue.
Something like this actually already exists, although it has a lot of additional features we likely wouldn’t need: https://github.com/upptime/upptime
Doesn't seem like a bad idea. Can we spike this for a couple of days and see if we can shake out any significant disadvantages @tillprochaska?
We’ve had a few more discussions about approach. We like that the only requirement is that there is a JSON file somewhere, but how that JSON file is created is up to the Aleph admin. It could be a file edited by hand, it could be a file generated using GitHub Actions as outlined above, or it could be dynamically generated from a third-party service such as statuspage.io.
@monneyboi suggested that this might not only be used to display information about the system status, but also to inform users about new datasets made available, new features deployed to the instance etc., so we might want to keep the schema of the JSON file generic (also see #2420).
I’ve been working with a JSON file structured like this:
[
{
"id": "1",
"createdAt": "2022-07-27T08:30:02.000Z",
"updatedAt": "2022-07-27T08:30:02.000Z",
"displayUntil": null, // optional
"level": "warning", // info, warning, error, success
"title": "Degraded ingest performance",
"body": "Processing ingested files currently takes longer than usual. If you have recently uploaded new document that haven't been processed completely, please check back later."
},
{
"id": "2",
"createdAt": "2022-07-27T08:00:29.000Z",
"updatedAt": "2022-07-27T08:16:25.000Z",
"displayUntil": "2022-07-28T08:16:25.000Z",
"level": "success",
"title": "Resolved: Public access temporarily disabled",
"body": "We have disabled public access to prevent Aleph from being overwhelmed by automated anonymous requests. We will resume providing public access soon.",
"updates": [
{
"id": "3",
"createdAt": "2022-07-27T08:16:25.000Z",
"updatedAt": "2022-07-27T08:16:25.000Z",
"body": "We have restored public access."
}
]
}
]
Some feedback on that schema from @monneyboi in Slack:
Have you thought about using "children" instead of updates, and just using the the same objects there? Would allow you to re-use components and just do things recursively.
That’s what I started out with. For our use case, it turned out to be a little more complicated to fit data into that schema, and I’m also not sure, we’d support rendering children recursively with a separate title, level etc. For example, for now, all we’d probably do is display the body and time of the most recent update. But it might be worth to reconsider it.
| Unresolved | Resolved |
|---|---|
![]() |
![]() |
Also, wouldn't it be nicer to just remove one of these things from the JSON instead of using a displayUntil field?
Yes, I agree in general, and you could definitely do that as well. We’d like to display a message for a fixed amount of time after an incident has been resolved.
As we’d like to use GitHub Actions to generate that file, we’d have to run a scheduled workflow in a regular interval to remove resolved issues after that time. It’s not super complicated to do, but it uses up build minutes, scheduled workflows on GitHub Actions are a little unreliable, and checking to ensure that displayUntil is in the future is rather simple to do in the frontend.
What do you think about using an integer for
level?
Honestly, I had not thought about that before. What would be the advantage? The implied priority of numbers?

