cmc-csci040 topic 7 lab tweet/word counts

The instructions for the lab say to download/use the files that start with 'master,' but the numbers given to check that our functions are working right correspond to the counts from the files that start with 'condensed.' Which set of files should we use?

Mar 15 '25 02:03 charlottegjordan

that happened to me too

Mar 16 '25 22:03 ShaeDelany

Hmm... this surprised me. But I just double checked and you're right. The short answer is: it doesn't matter which set of files you use.

The long answer: The master* set of files includes the full output of the twitter API, and the condensed* set of files is the output of the API preprocessed in a way that takes up less disk space. The full API output changes over time as twitter adds/removes features. One feature that was added is that in 2017, the max length of tweets was extended to 280 instead of just 140 characters. To accommodate this change, a new field full_text was added that contains all of the characters, and text is used just for the initial 140. Searching through the master* set of files for the filed text means that you will miss any tweets that Trump sent after 2017 if they included the keyword in the last half of the tweet. It seems that the preprocessing done in the condensed* set of files moves this extra information directly into the text field so that all the files can be processed by downstream tasks in a more consistent manner.

Sorry that I didn't notice this earlier and made you encounter this problem. But it's also a good lesson about data preparation and how very subtle problems can appear with seemingly minor changes in code. One of the big take aways of this part of class is that creating plots is really easy in python, but ensuring that you're actually plotting the right data is never easy.

Mar 17 '25 03:03 mikeizbicki