Gareth Davidson comments

Results 115 comments of


                                            Gareth Davidson

MusicBrainz database

So.. I pulled this and installed it locally, there's loads of juicy stuff in there. I'll see how viable it is to snag some Q+A data sets out of it...

Most user generated prompts are too complex

If you feel a prompt is too complex or it feels a bit like asymmetric warfare by the prompter, then simply skip it. If enough people skip a prompt it'll...

filtering troll data

I think releasing the data in full, with only legal things like PII removed, and with users and votes identified by hash or something would help. Then we have something...

Add standard task list to data tickets?

I've updated with some feedback. I'll actually do a smallish one and iron out the process before mass-editing people's data issues - might take a few days so more feedback...

Update README.md

> New text seems a little marketing speak too and buzzy imo. I agree. When it comes to writing docs I tend to think "it's not done when there's nothing...

Ensure the model does not promote self harm

I don't think the *assisted* suicide thing is that big of a risk or something we should block. If someone uses the thing to research their options, get referred to...

dataset: CSAM/NSFW instruction dataset from reddit

If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution? And wouldn't models trained...

dataset: CSAM/NSFW instruction dataset from reddit

I'll repeat what I said on Discord here: The risk with editing people's words and making them into paedophile content, then sharing it as a CSAM data set, is that...

dataset: CSAM/NSFW instruction dataset from reddit

> This is something I'm going to try out and see. @shahules786 can give more info about this. Even if the original text doesn't flag as CSAM in our model,...

dataset: CSAM/NSFW instruction dataset from reddit

> If we build a system to flag CSAM, which metric should it be more careful about, false positives or false negatives? If you were to pick one to optimise...