dataall
dataall copied to clipboard
Add Maintenance UI View / Window
Is your idea related to a problem? Please describe. Need a way to describe what new features had recently been introduced to data.all on latest releases and if there should be any warnings, deprecation notices, etc.
Describe the solution you'd like To communicate new features or maintenance actions - add a maintenance UI view or window where we can post messages and info about the current status of the platform etc..
P.S. Don't attach files. Please, prefer add code snippets directly in the message body.
My proposal to implement this:
- add a settings page which is only visible to data all administrators
- on this settings page add a button to put data.all into maintenance mode on/off.. store the state in RDS.
- when maintenace mode is on all users except da admins see maintenance window
- during maintenance mode all graphql calls are blocked except those needed to make maintenance mode work
- no new cron jobs / periodic jobs should start while maintenance mode is on.. no stacks should be updated.
- da admins only see the settings page while maintenance mode is on.. cannot navigate to anything else
Proposal
Phase 1
Maintenance Window
The purpose of the maintenance window is to create a safe environment for the data.all admin to deploy new changes in data.all. Currently because users can still access data.all UI in parallel when the deployment is happening there is a chance where user's actions might lead to improper state ( For e.g. a user makes some updates on environments and then data.all deployment updates the RDS schema simultaneously .This might take the environment stack in an improper state ). Apart from deployments, there are situations in which the data.all admins might want to create a maintenance window for some manual updates, etc.
The maintenance window would essential do the following -
- Provide a UI to initiate maintenance window and show the status of maintenance window.
- Block all mutation GraphQL calls and
- allow other graphQL calls when in READ-ONLY mode
- block all calls and show blank UI in NO-ACCESS mode .
- Except for data.all admin group who will be allowed on mutation API calls and thus will have full access to data.all.
- Disable Scheduled tasks on ECS.
- Monitor any running ECS tasks and wait for it to complete and show on UI if its okay to start the deployment ( See ECS Running Tasks options )
Changes for maintenance window feature
Maintenance Window UI
The maintenance window UI which will be only visible to the data.all admin group could be added in the Admin settings page as another tab.
The UI will show a button to create a maintenance window. When clicked the button will show a pop-up modal to confirm the action and then will start the maintenance window.
Maintenance window frontend mock -
- When maintenance window has not started
- Confirmation Pop-up modal
- Maintenance mode in progress
This UI will display
- Button to start the maintenance window
- Dropdown / toggle to select the mode of maintenance window ( READ-ONLY , NO-ACCESS )
- Refresh button to refresh the status of the maintenance window
- Status showing the maintenance window status ( Check the various states of maintenance window in "Schema Changes" section )
List of tasks performed by maintenance window process
Api Gateway
When in READ-ONLY mode,
GraphQL Calls - These calls will be blocked on mutation requests and query calls will be allowed. For data.all admin , all graphQL calls will work.
Opensearch Calls - These calls won'b be blocked as they won't cause any issues if deployment is carried out parallelly.
When in NO-ACCESS mode,
GraphQL Calls - All graphQL calls will be blocked. For data.all admin , all graphQL calls will work.
Opensearch Calls - All calls to the Opensearch calls will be blocked except for data.all admins
This blocking will take place inside the API handler graphQL lambda function. Whenever a user tries to perform actions which are restricted and errors with message like Access Restricted: data.all is currently undergoing maintenance, and your actions are temporarily blocked.
Long Running Tasks
ECS Tasks All scheduled ECS tasks - ECS stacks Updater, ECS table syncer, ECS Subscriptions , ECS catalog index syncer, ECS policy updater, ECS Share Verifier- will be disabled ( with the event-bridge boto3 calls ) and enabled after the maintenance window is over
ECS Running Tasks ( Design Decision Needed )
UPDATE - Going forward with Option 2 as its easy to implement graphQL endpoint and solves the purpose
Wait for all the long running tasks to complete. Programmatically keep on polling on status of all the ECS running tasks ( like how it is done for stack updates ). Once all running ECS tasks are completed, change the maintenance status to READY-FOR-DEPLOYMENT.
Option 1:
The way this would be achieved is, once the maintenance window is triggered, another ECS task will be triggered which will keep on polling to check if all the ECS tasks have been completed. The reason for creating another ECS task for it is , this polling itself would be a long running process and lambda runtime will not be sufficient.
Once the polling is completed, then the status
of maintenance window will be changed to READY-FOR-DEPLOYMENT.
Also, an email will be sent to data.all admins to inform that it is safe to start deployment.
Pros - Dedicated background process for monitoring ECS tasks Cons - Won't be present as a graphQL endpoint which could be accessible by API calls
Option 2 :
Once the maintenance window is triggered, from the frontend, keep on polling at an interval and calling a graphQL endpoint which will check if the ECS tasks have completed running.
Pros - No need of creating new ECS tasks and the code can be easily integrated as a new graphQL endpoint Cons - User will have to come to the data.all UI and trigger this polling. If user is not on the UI they will miss polling even when the ECS tasks have completed and maintenance window is in READY-FOR-DEPLOYMENT mode. Also, with this type of polling, the email notification will be sent only when the user returns to the maintenance UI - which will start the polling.
Additional Guardrails :
data.all admin can still deploy and start the code pipeline phase even when ECS tasks are still running. This needs to be taken care with some guardrails in the cdk app at the time of deployment. (TBD )
Schema Changes
A maintenance
table will have to be created with attribute like
Status : Showing status of maintenance window. ( ACTIVE, INACTIVE, PENDING )
Mode : READ-ONLY / NO-ACCESS mode
UI shown to the user when the maintenance window is scheduled
Depending on the mode of maintenance window , user will be able to view different UIs
READ-ONLY Mode
In this mode, when the maintenance window is scheduled, user will still be able to access the UI and navigate. All mutation related actions - create, edit a dataset, share, environment, etc will be blocked. A UI as shown below will be displayed to the user . This will serve as an indication that the maintenance window is in effect. The text about the maintenance window will be picked from config.json. If text is not present then a default as shown in the image below will be displayed.
NO-ACCESS Mode
In this mode, the user will be able to login but will be a blank screen (Like the Splashscreen ).
Frontend view when in NO-ACCESS mode ( This is showed after the user logins )
Phase 2
- Implement a scheduled maintenance window - https://github.com/data-dot-all/dataall/issues/1131
- Custom email notifications when scheduling a maintenance window ( these email notifications could also be sent irrespective of if the maintenance window is enabled or not ) - https://github.com/data-dot-all/dataall/issues/1132
- Email / UI Notifications about the Maintenance Window** ( TODO in the next phase ) In phase 2, bare minimum notifications will be sent to the data.all admins when it is ready to deploy. These emails will only be sent once there are no more ECS tasks and when the maintenance status changes to READY-FOR-DEPLOYMENT.
One other thing to put on the display would be a timer or information on when we expect the system to be back up. Not having that information could cause confusion and anxiousness to users.
I have some questions:
- How will the system handle long-running tasks that are in progress when the maintenance window begins? Will they be allowed to complete, or will they be forcibly terminated?
- What monitoring and alerting mechanisms will be in place to notify administrators of any issues or anomalies during maintenance windows? How will the system handle unexpected errors or failures?
- What procedures will be in place for rolling back changes in the event of unexpected issues or failures during maintenance windows?
- We should also have a way to notify users in advance of an upcoming maintenance window - either via email or on the notifications UI in data all or both.
I'm not sure we need to store user/times in RDS in a new table. If we need to store that for any reason it could conceivably be stored in the activity table as a single additional record. Alongside that, I wouldn't recommend setting specific start/end times of the maintenance window within data.all. data.all should show that a maintenance window is happening, perhaps having some custom configurable text shown (that may redirect users to some other internal communications like Slack/etc which is giving more details about any maintenance).
We may need to consider that admins will still need access to the UI/GraphQL during the maintenance window, for purposes of validation and/or debugging. Upgrades/deploys also may not be the only reason we need maintenance windows where we need to block users from using the data.all instance. Certainly regular users should be blocked, but ideally data.all admins can still do whatever they want with data.all.
One other thing to put on the display would be a timer or information on when we expect the system to be back up. Not having that information could cause confusion and anxiousness to users.
As the deployments do have variable time, having a timer which can be set on the maintenance UI will be tricky and in some places the data.all admin will have to extend it. Instead , a custom text - containing communication about the maintenance window and also who to contact , etc - can be shown ( which can be set in the config.json file ) on the UI when the maintenance is taking place. Thanks for suggesting this
I have some questions:
- How will the system handle long-running tasks that are in progress when the maintenance window begins? Will they be allowed to complete, or will they be forcibly terminated?
- What monitoring and alerting mechanisms will be in place to notify administrators of any issues or anomalies during maintenance windows? How will the system handle unexpected errors or failures?
- What procedures will be in place for rolling back changes in the event of unexpected issues or failures during maintenance windows?
- We should also have a way to notify users in advance of an upcoming maintenance window - either via email or on the notifications UI in data all or both.
For Q1. The maintenance window status will be in READY-FOR-DEPLOYMENT once the long running and any pending stacks which were running are completed. When the maintenance window is triggered on the UI, the subsequent graphQL calls will be blocked but all the running stack updates will be monitored till they are completed.
Q2. The purpose of the maintenance window would be just to create a safe window to deploy. In case of deployment failure, admins, devs will have to monitor and fix the deployment error. Although, it would be great to have some automation in case of failure it would be a larger development and out of scope of this task.
Q3. For any failures and roll backs, the deployment team will have to intervene and fix those issue.
Q4. I like this idea of notification to the users about the maintenance window. I will take this into consideration while making the design. Thanks!
From discussion with @TejasRGitHub:
- During maintenance window - only block mutations instead of blocking all calls to graphQL (for non DA Admins)
- No need for a new custom page to route users to during maintenance window (adds complexity to handle both current users logged in to dataall UI and new users logging in routed to a separate page)
- Can keep UI the same and block ALL mutations (unless DA Admin)
- Can have notification banner that just tells user that maintenance window is active
- Check if we need to poll on ECS and if its worth the developer effort.
- I would encourage the design of this PR to not include polling of ECS tasks once a Maintenance Window is enabled
- Even if the Maintenance Window is in a status of
Pending
what is preventing the DA Admins from upgrading and starting new CodePipeline Execution - I think Maintenance Window enabled should just block all NEW mutation API calls and prevent any NEW ECS scheduled tasks (responsibility on admin team to upgrade appropriately)
- Admin can add custom email notifications on a button present on maintenance window UI
- Similar to how maintenance window cna be enabled/disabled from admin settings, lets have a way for Admins to broadcast a notification (via UI button or similar) letting all DA Users know about maintenance window
- Can leverage existing Notifications module code to send notification in both data.all UI and via email using SES (if configured)
- Link to new features instead of showing up a modal pop-up
- For new features display can have a button in the header of data.all UI (as shown in image above) that links to the OS Release notes for the latest version (or links to some custom release notes via configuration file?)
- No need to dynamically read release notes from somewhere or pull in a markdown file - likely easiest to have a link here and will serve same purpose
- [Nice to have] - have a way to hide the link / modal popup for new features UI
- Look into frontend design to broadcast pop up notification to notify user that an upgrade just happened and to refer to new features for more information
- Once user acknowledges the pop-up it never shows again
@TejasRGitHub just to add on the topics above. I agree with the summary from @noah-paige: block mutations not queries, block new tasks/APIs only, DAAdmins can do all tasks. I would add:
- Since we are blocking a lot of actions, maybe we should make it very evident to users that the maintenance window is open. In my opinion, the notification banner should not be a temporal banner like the one we usually get, but a permanent window or a significant UI change (e.g. change some of the theme colors)
- I would like to be able the current data.all version at any moment in the UI, the new features and also THE FUTURE DEPRECATION of features. I think we are not showing this with the current design right?
@TejasRGitHub just to add on the topics above. I agree with the summary from @noah-paige: block mutations not queries, block new tasks/APIs only, DAAdmins can do all tasks. I would add:
- Since we are blocking a lot of actions, maybe we should make it very evident to users that the maintenance window is open. In my opinion, the notification banner should not be a temporal banner like the one we usually get, but a permanent window or a significant UI change (e.g. change some of the theme colors)
- I would like to be able the current data.all version at any moment in the UI, the new features and also THE FUTURE DEPRECATION of features. I think we are not showing this with the current design right?
Thanks for the comments.
For 1. I like this idea. Maybe I can have a permanent snackbar which doesn't disappear and stays on the screen till the time the maintenance window is ON.
For 2. The information about the new features and future deprecation will be mentioned in the release notes and there will be button as shown here
which will note the version number and clicking this button will take the user to the release notes page. For the purpose of informing the user that a new version of data.all is deployed , please see
Modal Pop-up for informing user about new release
on my comment
Hi @TejasRGitHub I am reviewing your design, here are my comments:
- in Phase 1, When in NO-ACCESS mode, all OpenSearch calls are blocked ----> Why are they blocked for the admins?
- Regarding the decision on the ECS Running Tasks status options ----> I think I am more inclined towards option 2. As far as we provide good error messages when we try to change the maintenance mode I think we are good. At the end, the users of this part of the feature are data.all admins, we assume they have a level of data.all understanding :)
- Question regarding the schema changes ----> Is the table going to contain a single row with the status and mode of the platform or are we going to create an item for each time we update the status (like a history of maintenance changes)?
- I like the phases approach
Hi @TejasRGitHub I am reviewing your design, here are my comments:
- in Phase 1, When in NO-ACCESS mode, all OpenSearch calls are blocked ----> Why are they blocked for the admins?
- Regarding the decision on the ECS Running Tasks status options ----> I think I am more inclined towards option 2. As far as we provide good error messages when we try to change the maintenance mode I think we are good. At the end, the users of this part of the feature are data.all admins, we assume they have a level of data.all understanding :)
- Question regarding the schema changes ----> Is the table going to contain a single row with the status and mode of the platform or are we going to create an item for each time we update the status (like a history of maintenance changes)?
- I like the phases approach
For 1 , Corrected the statement. All calls on opensearch will be blocked except for data.all admins. Thanks for pointing that out @dlpzx
For Question regarding schema, the schema will only contain one row and this will be inserted during the alembic migration. During each maintenance mode, the status of that one row will be updated to reflect the maintenance status at that time. For the history of who did what during maintenance mode , the activity table could be used.
@TejasRGitHub
-
Here:
GraphQL Calls - These calls will be blocked on mutation requests and others will be allowed.
Can you give an example of what "others" calls are? Which graphql calls will be allowed for normal users? -
Instead of error message:
'Unauthorized : data.all is in maintenance mode and thus your actions are blocked for now'
maybe we use:"Access Restricted: data.all is currently undergoing maintenance, and your actions are temporarily blocked."
-
Cons - User will have to come to the data.all UI and trigger this polling. If user is not on the UI they will miss polling even when the ECS tasks have completed and maintenance window is in READY-FOR-DEPLOYMENT mode.
For this: I would call this out as a risk and look for a solution for this. It can often happen that as an admin I started maintenance window and got busy with some other work. I should still have a way to know when the system is READY-FOR-DEPLOYMENT.
@TejasRGitHub
- Here:
GraphQL Calls - These calls will be blocked on mutation requests and others will be allowed.
Can you give an example of what "others" calls are? Which graphql calls will be allowed for normal users?
By Other calls I mean the query calls . I will correct the wording in the proposal. For normal users in READ-ONLY mode, query calls will be allowed and mutation calls will be blocked. In NO-ACCESS Mode, both query as well as mutation calls will be blocked
- Instead of error message:
'Unauthorized : data.all is in maintenance mode and thus your actions are blocked for now'
maybe we use:"Access Restricted: data.all is currently undergoing maintenance, and your actions are temporarily blocked."
Thanks for the suggestion. Will use this one instead.
Cons - User will have to come to the data.all UI and trigger this polling. If user is not on the UI they will miss polling even when the ECS tasks have completed and maintenance window is in READY-FOR-DEPLOYMENT mode.
For this: I would call this out as a risk and look for a solution for this. It can often happen that as an admin I started maintenance window and got busy with some other work. I should still have a way to know when the system is READY-FOR-DEPLOYMENT.
I completely understand this problem. In order to address it, I will be providing a note on the maintenance UI which will explicitly tell data.all admin user to check back on the UI for getting updates on maintenance window status. In the future phases, while working on scheduled maintenance window and email notifications this issue can be handled in a better way.
Close as completed