Deep-Learning-Playground icon indicating copy to clipboard operation
Deep-Learning-Playground copied to clipboard

[FEATURE]: Get Training Backend to Update User on Training Progress with AWS Appsync

Open dwu359 opened this issue 2 years ago • 11 comments

Feature Name

Get Training Backend to Update User on Training Progress with AWS Appsync

Your Name

Daniel Wu

Description

The http protocol that we use to communicate between the frontend and backend is unidirectional, meaning that the frontend needs to send a request for the backend to send back a response. To send back training progress, the backend needs to send multiple messages back to the frontend after the initial training request. Luckily, AWS AppSync handles that for us with its websocket pub/sub apis, which can be used to allow bidirectional communication between the frontend and backend. More specifically, we can have both the frontend and backend listen to AppSync's websocket endpoint for messages of a particular channel id and have both the frontend and backend make graph api requests to AppSync to send messages with the same channel id.

Use AWS AppSync to update the user on training progress for a particular training request. For now, let's say that training progress means the # of epochs completed.

dwu359 avatar Aug 21 '23 03:08 dwu359

Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. :wave:

We will look into it and provide a response as soon as possible.

To work on this feature request, you can follow these branch setup instructions:

  1. Checkout the main branch:
```
 git checkout nextjs
```
  1. Pull the latest changes from the remote main branch:
```
 git pull origin nextjs
```
  1. Create a new branch specific to this feature request using the issue number:
```
 git checkout -b feature-920
```

Feel free to make the necessary changes in this branch and submit a pull request when you're ready.

Best regards, Deep Learning Playground (DLP) Team

github-actions[bot] avatar Aug 21 '23 03:08 github-actions[bot]

@dwu359 can you provide more detail?

karkir0003 avatar Aug 21 '23 03:08 karkir0003

can you provide more detail here?

karkir0003 avatar Aug 22 '23 00:08 karkir0003

Why aws appsync instead of websockets?

andrewpeng02 avatar Feb 12 '24 21:02 andrewpeng02

@dwu359

karkir0003 avatar Feb 13 '24 02:02 karkir0003

Appsync seems to handle the websockets stuff for us, but if you are able to find a way to implement it via websockets, then go for it. I will say though that I looked into implementing it via websockets before and the library support for websockets isn't as good as rest apis.

dwu359 avatar Feb 13 '24 02:02 dwu359

I just don't see the need to use another service, and it'll also complicate development (we'd have to deploy to some staging env every time we want to test something?). I'll look into libraries

andrewpeng02 avatar Feb 13 '24 15:02 andrewpeng02

What other uses of websockets do you think we'd want to add in the future?

andrewpeng02 avatar Feb 13 '24 15:02 andrewpeng02

Django channels seem to be the accepted library for websockets, and the implementation won't be too bad. The one thing is we'd probably have to port our training methods as new websocket consumers and also deal with authentication a bit differently in a middleware. So, it seems like either:

  1. Port the entire train endpoints into websockets. Ninja schemas and stuff may not be supported?
  2. Define a websocket to just check on the current training epoch, will require 2 separate requests and figuring out how to connect the two will be annoying but it'll involve less refactoring (we can likely just create a job uuid on the client side and pass it to the endpoint and websocket)
  3. Retain the original HTTP training endpoints so we don't have to create new authentication and we have schema support for the input, but instead of doing the training in the endpoint, create a task via Celery and return the job id to the user (this is better for long-running tasks too). Then, the user will open a websocket with Django Channels and the Celery task will update the websocket group periodically with the progress and eventually return the result. Long-term, using Celery tasks would be best if we're planning on having long running train times especially with image data.

In terms of effort, 2 < 1 = 3

andrewpeng02 avatar Feb 13 '24 21:02 andrewpeng02

@dwu359 ?

karkir0003 avatar Feb 14 '24 00:02 karkir0003

Django channels seems like a good start, keep in mind you will need to find some way to host the websockets server (likely thru ec2) and access it (likely through api gateway or something else). I'm sorry I can't help much further, I'm no longer a direct contributor to this project and it seems like at this point you know more about websockets than I do.

dwu359 avatar Feb 14 '24 00:02 dwu359