russian-troll-tweets icon indicating copy to clipboard operation
russian-troll-tweets copied to clipboard

Many entries tagged with language=swedish are in fact in german

Open olofhagsand opened this issue 6 years ago • 9 comments

I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German. Such as entry nr 322020 in IRAhandle_tweets_1.csv: 7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE. You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?

olofhagsand avatar Aug 01 '18 20:08 olofhagsand

It doesn't look like very thorough research. I'm curious how much labor it takes to write 3 million tweets in 2 years. Besides being in several languages, don't appear to have a correlation with the election. Seem to be more correlated with normal internet use.

ericodavis avatar Aug 02 '18 10:08 ericodavis

@ericodavis I got info from one person working with commercial fake FB entries, each entry was awarded $2 for a 120 word entry producing about 10 entries per hour. These tweets are shorter. But extrapolating on those figures and freely speculating, and without automatic bots, this could yield a rate of ~100 tweets per person and day, so that 3M tweets could be produced by ~100 persons in two ýears with a cost of ~$100M. Again, this may be wildly off, hope there are better estimates out there.

olofhagsand avatar Aug 02 '18 16:08 olofhagsand

@olofhagsand This is a frequent problem that has to do with Twitter's language detection algorithm, particularly on short tweets. The same happens between Danish and Norwegian.

johannessweater avatar Aug 03 '18 07:08 johannessweater

@johannessweater OK thanks. So this is twitter's own classification? I thought it may have been the research group,... I see now out after browsing the 2252 entries marked as "Norwegian", I can detect no Norwegian at all. At least maybe 50% of the swedish antries were actually swedish. So I conclude the language classification is useless.

olofhagsand avatar Aug 03 '18 09:08 olofhagsand

@olofhagsand Language is part of the metadata that comes with tweets so I'm guessing that's where the language label comes from. But yikes, 50 percent is bad. In my experience it's usually more like 95 percent.

johannessweater avatar Aug 03 '18 09:08 johannessweater

@johannessweater Going thorough them I verified 66 entries as Swedish out of 1021 marked as Swedish (out of 3M total). That corresponds to 6.5% correctly labelled as Swedish. There may be more entries with actual swedish marked as other languages. I have not looked for them. Even worse, among the 2252 marked as Norwegian, 581 marked as Finnish and 499 marked as Icelandic, I detected 0% correctness. I.e., none of these were actually in Norwegian, Finnish or Icelandic. But I detected one in Swedish that was marked as Norwegian (nr 2826690) And BTW for your interest, here are the 66 entries marked as Swedish that I confirmed to be actual Swedish: 95420 95696 98517 102129 102153 102594 102870 102909 104564 106679 109151 109931 111427 111749 112482 115190 116317 119412 121071 122736 123911 124397 124684 207390 651135 897811 904937 907197 907285 907428 907763 970049 970728 980633 1073220 1208471 1231385 1231797 1231853 1235321 1235424 1235561 1235613 1324426 1624583 1648180 2109106 2109525 2114772 2115070 2115306 2115722 2198212 2826196 2826215 2826494 2830468 2830606 2830993 2831009 2831134 2831184 2935489 2944234 2968211 2968230

olofhagsand avatar Aug 03 '18 10:08 olofhagsand

This is how you do a Troll Army. I wonder when this will be sorted out. dj6qyfhx4aacg9r

ericodavis avatar Aug 06 '18 21:08 ericodavis

I suspect the language field is meaningless. For example, BERLINBOTE has tweets tagged in 26 different languages, but samples tagged Spanish, Vietnamese, and Polish are clearly in German. I suspect all of its tweets are in German. So either the language tagging done by Twitter is crap, or the language tag attached to the data is not the language tag inferred by Twitter.

jpallas avatar Aug 11 '18 05:08 jpallas

@jpallas @olofhagsand Yeah, this doesn't seem to be Twitter's language tag. I checked some of these tweets against duplicate tweets I was able to find in my own database from the election, and the language fields don't match up.

johannessweater avatar Aug 11 '18 10:08 johannessweater