Open-Assistant Periodically (daily/weekly) publish data under open licence

För encouraging participation the data should be made available periodically. Maybe OpenStreetMap could be an inspiration for that.

Feb 05 '23 09:02 13hannes11

Is it published at all? I could not find links to download

Feb 05 '23 11:02 sheerun

The data will be released. Currently there is no process for periodically exporting it. If you want to work on a system for doing that feel free to look into the codebase. Some issues to consider include cleaning (incl. PII removal etc) the data before any release

Feb 05 '23 11:02 olliestanley

As a bystander: it's the first question about the project that came to mind after watching Yannic Kilcher video on OpenAssistant.

Feb 05 '23 11:02 aidaho

Data should be released from day 0, not a pinky promise. Otherwise it's not Open Source. If someone wants to host forked version that releases all data immediately, only then I would consider contributing

Feb 05 '23 12:02 sheerun

Data should be released from day 0

As mentioned above, you can't just click a button and export the data as-is, because you risk releasing PII and/or illegal content and/or spam which hasn't been filtered out by the moderation process yet. We have not yet developed the system to process and export the data. If you want it to be a feature, make a PR adding that system instead of talking down other people's work please :)

Feb 05 '23 12:02 olliestanley

I'll watch this thread then for updates :)

Feb 05 '23 12:02 sheerun

اريد مساعده لانها لا استطيع اعمل بطريقه جيد

Feb 05 '23 12:02 Ibrahimabakar

(Also bystander) Mozilla Common Voice, which is in a similar situation, releases the data twice a year, which I think is a bit long of an interval. They also found out that it takes a lot of time to get everything formatted, checked and published for 100 languages. They also had an option for users to later changing their mind about having their data in the dataset, and their data was much larger since it contained also sound recordings. I hope we can automate the publications a bit more and use a shorter interval.

Feb 05 '23 12:02 jwstolk

ريد بطريق جد التي يعملون في

Feb 05 '23 12:02 Ibrahimabakar

docker run -p 6070:80 dorowu/ubuntu-desktop-lxde-vnc

Feb 05 '23 13:02 Ibrahimabakar

docker run -p 6070:80 dorowu/ubuntu-desktop-lxde-vnc

Feb 05 '23 13:02 Ibrahimabakar

Maybe one solution would be to release only data with sufficient user votes. Maybe some kind of export API could be provided for that. Probably creating exports like this on the fly is not really feasible so we could generate such exports nightly weekly.

I haven't managed to look into the codebase itself but from my persepctive the following things need to be answered:

At which point is certain data considered to be safe to export
How should such an export dump look like? (SQL database dump containing only safe data, ZIP, etc?)
How should that data be made available? (e.g. API, automatic upload to some kind of hosting platform or something else)

The most important question and most difficult to answer question is 1.

Feb 05 '23 14:02 13hannes11

Maybe one solution would be to release only data with sufficient user votes. Maybe some kind of export API could be provided for that. Probably creating exports like this on the fly is not really feasible so we could generate such exports nightly weekly.

I haven't managed to look into the codebase itself but from my persepctive the following things need to be answered:

At which point is certain data considered to be safe to export

How should such an export dump look like? (SQL database dump containing only safe data, ZIP, etc?)

How should that data be made available? (e.g. API, automatic upload to some kind of hosting platform or something else)

The most important question and most difficult to answer question is 1.

So currently we have a message tree system where trees progress from states to initial submission to completion. Once a tree is in completed state it will no longer receive new rankings, ratings, labels etc as we have collected sufficient information. The starting point would therefore be to only consider trees which are completed. Then we would want to filter them by their quality etc according to the evaluation process, and ensure those tagged with PII/obscenity etc do not make it in

Feb 05 '23 14:02 olliestanley

@olliestanley Has there been any progress in creating a method for exporting and publishing the data? To date after reviewing thousands of messages I have not come across any instance of "PII" or "illegal" data. I hugely doubt there is much risk in that respect. A single publication of the current database in any format would boost everyone's confidence that our free labor is not going into a walled garden, even if it's not perfect (has spam, etc.)

Honestly the entire unfiltered set of message trees would be valuable for learning what types of messages the community doesn't think are appropriate.

If you need me to create a PR that exports the database for this to be possible... I think I could write code that does a rudimentary SQL dump.

Mar 05 '23 08:03 mashdragon

@olliestanley Has there been any progress in creating a method for exporting and publishing the data? To date after reviewing thousands of messages I have not come across any instance of "PII" or "illegal" data. I hugely doubt there is much risk in that respect. A single publication of the current database in any format would boost everyone's confidence that our free labor is not going into a walled garden, even if it's not perfect (has spam, etc.)

Honestly the entire unfiltered set of message trees would be valuable for learning what types of messages the community doesn't think are appropriate.

If you need me to create a PR that exports the database for this to be possible... I think I could write code that does a rudimentary SQL dump.

In recent weeks an export script has been developed which can handle export of messages which are not flagged as spam, PII, etc. Features like exporting anonymised user/message IDs have also been added this week. This data is now starting to be used by a few people and some samples have been made public in the OA model eval repo. I think we are not far from being able to release more fully.

Mar 05 '23 12:03 olliestanley

Which data to write for you

‫في الأحد، 5 مارس 2023 في 1:22 م تمت كتابة ما يلي بواسطة ‪Oliver Stanley‬‏ @.***‬‏>:‬

@olliestanley https://github.com/olliestanley Has there been any progress in creating a method for exporting and publishing the data? To date after reviewing thousands of messages I have not come across any instance of "PII" or "illegal" data. I hugely doubt there is much risk in that respect. A single publication of the current database in any format would boost everyone's confidence that our free labor is not going into a walled garden, even if it's not perfect (has spam, etc.)

Honestly the entire unfiltered set of message trees would be valuable for learning what types of messages the community doesn't think are appropriate.

If you need me to create a PR that exports the database for this to be possible... I think I could write code that does a rudimentary SQL dump.

In recent weeks an export script has been developed which can handle export of messages which are not flagged as spam, PII, etc. Features like exporting anonymised user/message IDs have also been added this week. This data is now starting to be used by a few people and some samples have been made public in the OA model eval repo. I think we are not far from being able to release more fully.

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1162#issuecomment-1455076339, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5VBVLWM6FP53IMZGL7WRKDW2SAO7ANCNFSM6AAAAAAURVXXCQ . You are receiving this because you commented.Message ID: @.***>

Mar 05 '23 12:03 Ibrahimabakar

I have to echo this sentiment. I appreciate the desire to release a quality dataset, but I don't follow the logic of withholding it until it's been "cleaned".

Judging on how the dataset was collected in the first place, I can't even imagine PII or illegal? content even making it in there. Even so, I feel like this could be solved with a keyword search.

I'm not alone in this sentiment either, this is a fairly common question ATM. Why not just release the raw unfiltered dataset along with a curated version, and stick a disclaimer on it? I'm sure there are plenty of projects that could find uses for content identified as spam or poor quality. In any case, it would make a lot of us feel better and stop grumbling.

Maybe I'm being overly sensitive due to all the moral gatekeeping that's happened in the space lately, but this is a dataset for an open source project that many of us actually contributed to.

Clearly the data is in a workable state to the point that several models were able to be trained internally.

At the very least, estimates would be nice. The last update on the discord was from over two weeks ago, and the model has already been out for a little while now.

EDIT: I should also mention that if there's work that could be done to help export the unfiltered data, I'd be happy to help however I can, as some other people have also suggested in this issue.

Mar 20 '23 22:03 SOCSChamp

Update: the plan is to release the initial model and dataset together on the 15th of April. So hang tight.

Mar 21 '23 03:03 bitplane

Closing this issue as version 1 dataset release has been confirmed a while ago to be on the 15th April, and nobody has shown interest in working on the automated process for regular releases,

Apr 10 '23 09:04 olliestanley

@olliestanley can you make an official commitment to periodically publish newly collected data - under a commercial-friendly open source license - in the future? Even once every six months would be fine.

Before contributing to a project such as this, I'd like to have a guarantee that my contributions would always be available and not locked behind some payment wall, unacceptable licensing or anything of the sort.

Apr 16 '23 13:04 nushoin

@olliestanley can you make an official commitment to periodically publish newly collected data - under a commercial-friendly open source license - in the future? Even once every six months would be fine.

Before contributing to a project such as this, I'd like to have a guarantee that my contributions would always be available and not locked behind some payment wall, unacceptable licensing or anything of the sort.

It's not my decision to make but we can definitely consider setting a regular cadence. Could you make a new issue suggesting this with your idea for a cadence/commitment? This thread got quite derailed so would be good to have a clean one

Apr 16 '23 13:04 olliestanley

Open-Assistant Open-Assistant copied to clipboard

Periodically (daily/weekly) publish data under open licence

Open-Assistant
Open-Assistant copied to clipboard