datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Requesting permission to add Indian Sign Language dataset

Open professorcode1 opened this issue 1 year ago • 3 comments

The Indian Sign Language Research and Training Center has a staggering 11993 videos labeled with the word they correspond to in extremely high quality. I would like to add it to this repository.

The videos can either be accessed through their website(https://divyangjan.depwd.gov.in/islrtc/) or this google drive link (https://drive.google.com/drive/folders/1U-Pr4r1-cupgNOOq9NH_uTsQnPSVEKco).

Challenges:

  1. The website shows the video via Youtube embedding. Those video being on Youtube means that the drive will have to be used.
  2. The total number of videos is 11993 which total to 120 GB's of data.
  3. Regular Google Drive downloader that don't hit the Google Drive API can only download the first 100 files for a folder. To access the Google Drive API you need to register a project on Google Dev tools.

Proposed solution The user will have 2 options

  1. Provide their API keys if they wish to access the entire dataset.
  2. Just use 100 videos per alphabet (that's still 2600 videos and ~26 GB's of data)

In either case the dataset will not synchronously download the entire dataset since drive download speeds tends to be limited and making too many requests too quickly can get your API keys banned. Rather, it will maintain a buffer of videos (say 100 videos) and once a person yields enough samples(say 66%), it will asynchronously dispatch a request to add more.

Please let me know your thoughts. Thanks!

professorcode1 avatar Aug 20 '24 20:08 professorcode1

Hi @professorcode1 My thoughts are as follows: Perhaps, similar to MS-ASL, YouTube-ASL and YouTube-SL-25, there should be a base dataset called YouTube. Then, every implementation should specify the data (text, id, gloss (if any), signwriting (if any) and video link to youtube. The base dataset will be in charge of downloading from YouTube directly.

What do you think?

AmitMY avatar Aug 21 '24 11:08 AmitMY

Hey @AmitMY.

Please tell me what all functionality the base Youtube dataset class should have. It might add unnecessary complexity if all it does is call download_youtube function on behalf of its derived classes.

professorcode1 avatar Aug 21 '24 16:08 professorcode1

I can't tell you all the functionality, since I did not build it, I can just imagine that there needs to be a unified way to download videos from youtube (using youtube-dl or something similar)

You could start, make a PR, and I'll be happy to give feedback

AmitMY avatar Aug 29 '24 10:08 AmitMY