deno_registry2 icon indicating copy to clipboard operation
deno_registry2 copied to clipboard

Moderation filters

Open lucacasonato opened this issue 3 years ago • 17 comments

We should automatically moderate the names of modules people are uploading. I think we can start with these three steps (ordered by priority):

  1. Add a list of reserved module names that can not be registered automatically. Easiest would be a json file with an array of disallowed names. (@lucacasonato)
  2. Check any new module name against a list of 'bad' words. We need to find a list to use (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt is not good as it blocks words everyday words like color, queer, or africa). (up for grabs)
  3. Disallow any module names that have a levenshtein distance of less than 3 to any other existing module name, bad word, or reserved module name. (up for grabs)

lucacasonato avatar Aug 10 '20 16:08 lucacasonato

3. Disallow any module names that have a levenshtein distance of less than 3 to any other existing module name, bad word, or reserved module name. (up for grabs)

Unless I'm misunderstanding I'm not sure how this can possibly work. E.g. eslint and tslint, or any two dictionary words that happen to be a letter apart https://listography.com/spamtastic/words/that_are_one_letter_apart let alone 2. What do npm or cargo do about this?

nayeemrmn avatar Aug 10 '20 17:08 nayeemrmn

Unless I'm misunderstanding I'm not sure how this can possibly work. E.g. eslint and tslint, or any two dictionary words that happen to be a letter apart https://listography.com/spamtastic/words/that_are_one_letter_apart let alone 3.

I am not locked into the exact distance (if 1 gives desired results, we can do that). What we are trying to prevent is someone registering oak2 or oakk or 0ak. So that if you mistype or are not too familiar with Deno modules yet you do not accidentally install the wrong module (that might be malicious). I don't want someone to publish color and someone else to publish colour. Things like that are so confusing.

Yeah, this means that some module names are not available, but I think that cost is worth it.

What do npm or cargo do about this?

AFAIK npm does nothing about this (see https://www.npmjs.com/package/exxpress or https://www.npmjs.com/package/expres). For cargo I do not know.

lucacasonato avatar Aug 10 '20 18:08 lucacasonato

I don't think it's worth it. Especially with Deno where you're more likely to get the correct URL by copy-pasting it from somewhere, the mistyping problem should be especially rare and we can chalk the rest of it up to personal responsibility. There just isn't a nice rhyme or reason to what words are close together in distance -- weird names can rule out ubiquitous names just by being there first. And as I said it's far too usual for common words to be a letter apart.

There are better solutions. Have a dictionary for things like color and colour to make them specially mutually exclusive. Allow reputable modules to "claim" similar names (as they would buy similar domains). Use down-scoring based on name similarity to something well-known.

nayeemrmn avatar Aug 10 '20 19:08 nayeemrmn

Perhaps this is a common issue in NPM where you mistype a letter and get the wrong module, but the URL system does request more attention from the user at the time of choosing a library.

Oak2 might be a completely valid name to submit in my opinion.

Soremwar avatar Aug 10 '20 19:08 Soremwar

I can get started on the bad words filter 👌

wperron avatar Aug 14 '20 12:08 wperron

Found a couple of lists that we could use for the comparison:

  • https://gist.github.com/jamiew/1112488
  • https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en
  • http://www.bannedwordlist.com/lists/swearWords.txt

@lucacasonato what do you think?

wperron avatar Aug 14 '20 18:08 wperron

@wperron Thanks! Any of those work, I'm sure...

ry avatar Aug 14 '20 19:08 ry

Can't we just combine all three into one?

@wperron Do you think we should store with the source code, or as a table in the database that we check against?

lucacasonato avatar Aug 14 '20 20:08 lucacasonato

I don't want to have the list just disappear from under us, so my plan was to copy the list into the project. Tbh, I don't know if creating a collection in Mongo just to store a couple of swear words is really worth it, plus putting it in the repository gives it a lot of visibility, we can link to the file in the README for example.

As for combining all three of them, yes of course we can 😛

wperron avatar Aug 14 '20 20:08 wperron

plus putting it in the repository gives it a lot of visibility

We might not want that. Getting around it is a lot easier then :-). A database collection makes it a lot harder to find which words are included.

lucacasonato avatar Aug 14 '20 21:08 lucacasonato

@lucacasonato do you have a list of reserved module names ready to go? I could include that check in #81 while I'm at it

wperron avatar Aug 20 '20 12:08 wperron

@wperron Reserved module names are now handled as unlisted modules without uploaded versions. Easier because we can store them in the DB that way.

lucacasonato avatar Aug 20 '20 13:08 lucacasonato

Hi!

I'm making a package name validation library. https://github.com/TomokiMiyauci/is-valid-package-name/tree/beta/deno_land

Deno seems to confirm the contents of badwords.txt in S3 with the validation of the module name. Is there a way for me to check the contents of badwords.txt?

TomokiMiyauci avatar Jun 05 '21 07:06 TomokiMiyauci

Deno seems to confirm the contents of badwords.txt in S3 with the validation of the module name. Is there a way for me to check the contents of badwords.txt?

Not currently, the badwords.txt is stored on a private s3 bucket with public access blocked https://github.com/denoland/deno_registry2/blob/main/terraform/main.tf#L110-L134

wperron avatar Jun 07 '21 11:06 wperron

Yeah, I checked.

putting it in the repository gives it a lot of visibility, we can link to the file in the README for example.

Do you plan to release the file?

TomokiMiyauci avatar Jun 07 '21 14:06 TomokiMiyauci

Not at the moment, see Luca's answer above

wperron avatar Jun 07 '21 14:06 wperron

@wperron Thank you for answering

TomokiMiyauci avatar Jun 07 '21 15:06 TomokiMiyauci