Open-Assistant Set up a process for collecting raw datasets

May community members spend a lot of time scraping, building, or otherwise assembling datasets that could be useful for training the assistant. We want to collect all of this data in a central place, so that this valuable work does not get lost. For now, the goal is just to collect the raw data as people create it, no need yet to do any cleaning or processing. This can be done in a later step.

[ ] Set up an s3 bucket to collect raw datasets (via LAION)
[ ] Determine who gets what permissions to the s3 bucket
[ ] Define a process to get new datasets into the s3 bucket (e.g. send them to persons X or Y)
[ ] Document this in a public place where dataset creators can easily be pointed to

Dec 21 '22 16:12 yk

I have a background with AWS and will be able to solve this. The only point that is not clear here is where the data should be coming from? Only huggingface?

I'm not familiar with user-management system you have right now, but I assume we will be able to create/adjust bucket policy. Better to have a single role for all "standard" users so that we're not overwhelming bucket policy.

Third point is not clear, we can upload datasets via AWS API in a pipeline script. But I'm not sure what you mean by sending them to persons X and Y.

Jan 01 '23 21:01 onegunsamurai

@onegunsamurai thank you very much for offering. I've put this particular issue on "blocked" for now because I think we'll initially go with the HF hub as storage, and are trying to figure out what other backends could be used. so nothing to do as of now

Jan 01 '23 21:01 yk

Hi @yk – I wanted to offer Storj DCS (https://github.com/storj/storj) as an option for the storage backend. Storj DCS is a globally distributed object store that is compatible with the S3 API, and also compatible with Hugging Face Hub.

Storj DCS is an open-source, participant-driven cloud storage model that closely aligns with the open-source values and distributed access/training patterns of OpenAssistant.

The distributed model is able to be much more performant than AWS, because it utilizes BitTorrent-style parallelism in combination with erasure coding to accelerate large data transfer (and enable globally available storage). We can easily performance test to prove out the performance advantage. It is also $7/TB of bandwidth vs $80/TB on AWS (and we can also offer a nice grant for open-source).

It is a great fit for federated/distributed training like what we are building with OpenAssistant.

I would like to help lead this initiative for bucket storage if possible.

Jan 14 '23 20:01 keleffew

As an alternative to paying for storj, S3, or letting HuggingFace have final say over what data is or isn't acceptable, archive.org are pretty reliable and supply torrents too.

archive.org URLs with btinfohashes as a backup would be a pretty safe place to store things. Anyone can upload, and each item becomes a single line of text.

Feb 24 '23 12:02 bitplane

Open-Assistant Open-Assistant copied to clipboard

Set up a process for collecting raw datasets

Open-Assistant
Open-Assistant copied to clipboard