Deep-Learning-Playground [FEATURE]: Create Train and Test Datasets from User-Uploaded Dataset in S3 for /training

Feature Name

Create Train and Test Datasets from S3 for /training

Your Name

Daniel Wu

Description

As of right now, the training backend can only handle default datasets for /tabular. Allow user-uploaded datasets to be used for tabular training by implementing a dataset creator in training/dataset.py to allow the /tabular endpoint route to read a file from s3 given the filename and split it into train and test datasets.

Right now, datasets are stored in s3 in the dlp-upload-bucket in the location {uid}/{trainspace_type}/{filename}.

You can upload files to the bucket with https://em9iri9g4j.execute-api.us-west-2.amazonaws.com/ SST prod endpoint and /datasets/user/{type}/{filename}/presigned_upload_url route. EDIT: The above statement is not true, see below

You will need a bearer token also, which can be obtained using the backend cli. For more info, cd training && poetry run python cli.py --help.

Aug 20 '23 22:08 dwu359

Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. :wave:

We will look into it and provide a response as soon as possible.

To work on this feature request, you can follow these branch setup instructions:

Checkout the main branch:

```
 git checkout nextjs
```

Pull the latest changes from the remote main branch:

```
 git pull origin nextjs
```

Create a new branch specific to this feature request using the issue number:

```
 git checkout -b feature-913
```

Feel free to make the necessary changes in this branch and submit a pull request when you're ready.

Best regards, Deep Learning Playground (DLP) Team

Aug 20 '23 22:08 github-actions[bot]

@NMBridges youre doing this task

Sep 06 '23 22:09 karkir0003

@NMBridges My bad, this task should deal with reading the dataset files from s3 into training, not writing files to s3.

Sep 16 '23 15:09 dwu359

https://github.com/DSGT-DLP/Deep-Learning-Playground/blob/nextjs/training/training/core/dataset.py

should be the file to implement this endpoint in @NMBridges

Sep 16 '23 15:09 karkir0003

@NMBridges also, assume the scope of this use case to be for tabular (so reading CSV from S3 and then building train/test dataset). See example dataset creator class in the linked file

Sep 16 '23 15:09 karkir0003