nightly icon indicating copy to clipboard operation
nightly copied to clipboard

Newsletter being overrun by illicit repos

Open cdhagmann opened this issue 1 year ago • 11 comments

Screenshot 2024-11-04 at 9 44 32 AM

The number of sketchy repos (The first 13 in the Nov 1 newsletter are illicit) has gotten to the point that GMail automatically flagged the newsletter as spam and disabling all links even after explicitly marking it as not spam.

cdhagmann avatar Nov 04 '24 14:11 cdhagmann

I have noticed that most of these repos only have a .zip file in them and have a lot more tags than normal (most including the word free). I will hopefully have time later the week to look at the code and see if my skillset is enough to offer a PR.

cdhagmann avatar Nov 04 '24 14:11 cdhagmann

I've just come here to look at the same. I really like the newsletter but it's being completely overrun as you say. I've 0 ruby skills, I'm a go dev, but think it should be pretty straightforward to filter the majority out. Most seem to contain the words free and the current year. Could just have a regexp checking for that as a start.

alexjbarnes avatar Nov 13 '24 19:11 alexjbarnes

It's a cat/mouse game. Here you can see a list of words we already consider malware:

https://github.com/thechangelog/nightly/blob/d1a9e73e1aafda4adfd5db50cb5db78b1cfd4a88/lib/core_ext/string.rb#L43-L48

The challenge with blocking repos with the words free or the current year in the title/description is how many legit repos will you also block doing that?

jerodsanto avatar Nov 13 '24 19:11 jerodsanto

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

alexjbarnes avatar Nov 13 '24 19:11 alexjbarnes

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

alexjbarnes avatar Nov 13 '24 19:11 alexjbarnes

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

For sure. But that goes back to the cat/mouse game. I've played it for awhile, but it never ends. This is just the latest iteration for the mouse. A few days after I block free or download the naming changes again...

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

GitHub often gets them removed by the next day, but it's rarely the case by the time we publish. Another idea I had, which I was hoping would be more fool proof, was to identify a set of repos that are spam and a set of repos that are not (given name, url, description only) and give them to an LLM, asking it to determine if a given repo is spam/malware based on those two data sets. Unfortunately, in my testing this proved... inaccurate. (that was maybe a year ago, though, so maybe they've gotten better?)

There's other rules we could enforce, such as if the repo only has one zip file, but that also requires more API calls and this code is a bit ossified already, being a ten-year old Ruby project.

jerodsanto avatar Nov 13 '24 19:11 jerodsanto

Ah ok yes that's fair enough, appreciate that must be frustrating. Would you be open to PR's or does it seem like a lost cause at this point? Would be a shame as I've discovered lots of good stuff through the newsletter.

One thought I had was I wondered if it would be possible to add a lag into it. So say the stats lag 24 hours behind where they are now. But then we could rely upon GitHub removing them and checking they exist. Happy to have a look if it could be of use.

alexjbarnes avatar Nov 13 '24 20:11 alexjbarnes

Could we just have a more rolling average approach? Do not show any repo that isn't at least 3 days old but all three days count toward making into the new list.

EDIT: I'll learn to read some day. I like @alexjbarnes idea and would be willing to look into implementing.

cdhagmann avatar Nov 13 '24 20:11 cdhagmann

I'm not considering it a lost cause, just kinda in the dumps about it.

Will definitely accept PRs. I've considered delaying the Top Starred Repositories – First Timers and Top New Repositories lists by a day, but that kinda defeats the purpose of the email, which is to

unearth the top new and top starred projects on GitHub before they blow up

For now I'll go ahead and block a few more keywords because it is getting ridiculous again. Specifically, I'm going to add 'free', 'download' and 'crypto' to the list of malware words. That will certainly exclude some legit repos, but it's probably a trade-off worth making at this point.

What would be totally cool is some kind of API (maybe a separate project?) that I could hit with a repo URL and it returns how likely it is to be spam/malware or not, maybe with a confidence score. I'd certainly integrate something like that...

jerodsanto avatar Nov 14 '24 14:11 jerodsanto

I see what the I can do RE: confidence. Also, I have noticed that most of these repos do not register as having a programming language. Do you think we could have two pipelines, one where repos without a programming language have a longer list of malware words?

cdhagmann avatar Nov 15 '24 13:11 cdhagmann

That's certainly a possibility. We already have a no_language? method on the Repo class, so the malware? method could call that first and branch from there. Currently I'm implementing malware? as a String method, but that could be moved to Repo pretty easily...

Side note: Last night's email was pretty clean after adding those additional strings:

https://nightly.changelog.com/2024/11/14

That Solana repo is probably trash, but other than that...

jerodsanto avatar Nov 15 '24 15:11 jerodsanto