[FEATURE]: Create Train and Test Datasets from User-Uploaded Dataset in S3 for /training
Feature Name
Create Train and Test Datasets from S3 for /training
Your Name
Daniel Wu
Description
As of right now, the training backend can only handle default datasets for /tabular. Allow user-uploaded datasets to be used for tabular training by implementing a dataset creator in training/dataset.py to allow the /tabular endpoint route to read a file from s3 given the filename and split it into train and test datasets.
Right now, datasets are stored in s3 in the dlp-upload-bucket in the location {uid}/{trainspace_type}/{filename}.
You can upload files to the bucket with https://em9iri9g4j.execute-api.us-west-2.amazonaws.com/ SST prod endpoint and /datasets/user/{type}/{filename}/presigned_upload_url route.
EDIT: The above statement is not true, see below
You will need a bearer token also, which can be obtained using the backend cli. For more info, cd training && poetry run python cli.py --help.
Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. :wave:
We will look into it and provide a response as soon as possible.
To work on this feature request, you can follow these branch setup instructions:
- Checkout the main branch:
```
git checkout nextjs
```
- Pull the latest changes from the remote main branch:
```
git pull origin nextjs
```
- Create a new branch specific to this feature request using the issue number:
```
git checkout -b feature-913
```
Feel free to make the necessary changes in this branch and submit a pull request when you're ready.
Best regards, Deep Learning Playground (DLP) Team
@NMBridges youre doing this task
@NMBridges My bad, this task should deal with reading the dataset files from s3 into training, not writing files to s3.
https://github.com/DSGT-DLP/Deep-Learning-Playground/blob/nextjs/training/training/core/dataset.py
should be the file to implement this endpoint in @NMBridges
@NMBridges also, assume the scope of this use case to be for tabular (so reading CSV from S3 and then building train/test dataset). See example dataset creator class in the linked file