ph-submissions
ph-submissions copied to clipboard
Text Mining YouTube Comment Data with Wordfish in R
The Programming Historian has received the following tutorial “Text Mining YouTube Comment Data with Wordfish in R” by @hawc2, @jantsen, and @nlgarlic. This lesson is now under review and is available here:
http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.
Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.
Anti-Harassment Policy
This is a statement of the Programming Historian’s principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.
This is a very interesting tutorial that I think our audience will enjoy. There are two macro issues that need to be addressed before moving forward and some small typos suggestions:
Macro Issues
- The code blocks throughout this tutorial need to contain comments. Right now, they are difficult to follow.
- When submitting the tutorial, it would be better if it was turned into a Markdown file first so that those that are reviewing it can follow it more easily. The code should still be available to create the visualizations and the visualization should be below the code. There is information on how to do that here: https://bookdown.org/yihui/rmarkdown/markdown-document.html.
Micro Issues
-
Paragraph 1 in Introduction to YouTube Scraping and Analysis-Remove the second sentence
-
Paragraph 2 in Introduction to YouTube Scraping and Analysis-Remove third word ("also")
-
Paragraph 3 in Introduction to YouTube Scraping and Analysis-Remove "both" in first sentence
-
Paragraph 3 in Introduction to YouTube Scraping and Analysis-"with the formation of organizations such as the Association of Internet Researchers"
-
Paragraph 5 in Introduction to YouTube Scraping and Analysis-"Through this tutorial, you will learn how to access the YouTube API, process and clean the video metadata, and analyze the comment threads for ideological scaling."
-
Paragraph 8 and 9 in Introduction to YouTube Scraping and Analysis-Do not need these two paragraph
-
Paragraph 2 in Scraping the YouTube API-beware should read "be aware"
-
After Paragraph 3 in Configuring Your Code-Need a screenshot here
-
Final Paragraph in Configuring Your Code-"on the Github repository" should read "in the GitHub repository"
-
Last two paragraphs in Configuring Your Code-last two paragraphs should be combined
Please let me know what your timeline is on this and any questions you may have @hawc2, @jantsen, and @nlgarlic.
Thanks @nabsiddiqui. I made the minor edits you mentioned, except Paragraph 8 and 9 seem worth keeping, perhaps as a footnote?
I also wasn't sure what screenshot we should insert for Paragraph 3 in Configuring your Code?
We will update the code to include more comments as you ask, and once we get the knitting to work correctly with the .rmd file, we will update the Github repo with the proper formatting for the markdown file.
We aim to be done with our edits next week. I'll update you when the file is ready for review.
Sounds good @hawc2. Let me know if you need anything else on my end. And yes, paragraph 8 and 9 can be kept as footnotes.
Hi @nabsiddiqui and @hawc2! Just checking in to see if you needed any help moving this lesson forward.
Hi - thanks for reaching out. We actually just met to go over the changes this week. We expect to have them ready in the next couple of weeks at the latest.
Nikki
Sent from my iPhone
On Dec 9, 2021, at 7:21 PM, Sarah Melton @.***> wrote:
Hi @nabsiddiquihttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnabsiddiqui&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KmaDX2Wh5gX47q%2Fh%2Bj%2B5wgG5oxzB4EnW7L2oXo8HZJw%3D&reserved=0 and @hawc2https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhawc2&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749136852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PsSmAC17EazwcsGUqIg2dngvJZoWcyXoKQLoKBbc%2FJA%3D&reserved=0! Just checking in to see if you needed any help moving this lesson forward.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprogramminghistorian%2Fph-submissions%2Fissues%2F374%23issuecomment-990437810&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749146818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=UOyQdDS%2FTe4m%2By0a6ih62B%2FMOUKK3VgT6Vlsul%2BMTz4%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANJXLVGU76JBLWWBT5PQIXDUQFBXRANCNFSM46VS3TNQ&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SdTHn7UfETK3%2FfOwLIyKK4qtqp1JuCPxVYWg16JSeVo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749156772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uKU2xGRERE9qAqfvshxrC%2FbuxQXwOxjXUOdjbDhpMnQ%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cnlgarlic%40temple.edu%7C5fc0518a38d2426d8c8c08d9bb72f93e%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637746924749166735%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9HERDXqjmQyXOw24W%2BWw9sc9X55UyFpVi%2FMOSLN%2Bg78%3D&reserved=0.
Hey @svmelton. @hawc2 had requested some additional time to work on this in an email he sent to me. I should have probably communicated that in the issue tracker. But, no I think we are all good on moving the lesson forward as planned.
Hello all,
Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals
A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r
Please let me know if you encounter any difficulties or have any questions.
Very best, Anisa
@nabsiddiqui we now have a new topic for R for the menu, please include it as "r" in the topics in the lesson metadata. Thanks
Thank you, @jenniferisasi! I've added this to the YAML for you @nabsiddiqui.
@nabsiddiqui we’re excited to report we finished updating our YouTube scraping tutorial and it should now be ready for review: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r
Apologies for delays - after we submitted this lesson earlier in the pandemic, we discovered that there were a few sustainability issues, especially involving some of the libraries we were using for text wrangling data into WordFish. We’ve switched to quanteda
, and in the process, we condensed the lesson and hopefully simplified/clarified a few sections. We’ve made some other updates as well in order to streamline the code, including removing specific directions for setting up access to the YouTube API, since those directions seem to be changing regularly, and we can link to the Google page with the directions.
We also removed a few options for granular scraping, including a way to search for videos through the API. We intend to provide some of these alternatives on our Github page for those who’d like to explore further. Near the end, we’ve reduced the number and complexity of visualizations, but if reviewers think it needs more, we could build that section out more. Probably some steps in the newer version of the code need to be explicated further.
We look forward to getting feedback on this lesson!
Dr. Heather Lang @hlang264 and Dr. Janna Joceli Omena @jannajoceli have graciously agreed to serve as reviewers. We are shooting for a late August response for now @hawc2.
Thanks @nabsiddiqui.
One update - YouTube has created a streamlined path for Researchers looking to access their API: https://research.youtube/
This might make some aspects of our tutorial easier, and may require some updating. We'll investigate how this process works and update our tutorial accordingly in the fall after we receive reviewer feedback.
Tutorial review by Janna Joceli Omena
Title: "Text Mining YouTube Comment Data with Wordfish in R" GitHub link: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/youtube-scraping-wordfish-r Editor: Nabeel Siddiqui
Overall evaluation This tutorial uses a natural language processing algorithm (Wordfish) to conduct textual analysis from Youtube. It presents a good overview of YouTube as a platform. However, it requires some improvement in clarifying the data collection method to the reader. Moreover, the tutorial can benefit from some work on reorganising the order of sections and, in some cases, renaming the headings and subheadings. Finally, bullet point lists, annotated screenshots, gifs and short videos are recommended as pedagogic tools to be considered in this tutorial.
Review statement The tutorial review will follow a bullet-point mode, proposing suggestions, providing feedback and raising questions to the authors.
Part I: Introduction to YouTube Scraping and Analysis
-
Define web scraping and crawling, as the academic community still does not fully understand these data collection methods.
-
This section would benefit from one or two paragraphs presenting how YT has been studied and reviewing existing YouTube-related tutorials. This would help the author to situate the proposed tutorial yet explain to the reader its relevance and how it differs from others.
-
Is there a reference (i.e. a paper, GitHub repository, white paper) for the Wordfish algorithm? If so, I recommend that the author include it in the text. Moreover, all tools or scripts in use or mentioned in the text also deserve a proper reference ;)
-
Data collection methods refer to different technicalities; for example, web scraping, crawling or API calling has different features and functions to help scholars with the task of building a dataset. The authors mention that they have used the YouTube Data API to retrieve data from the platform. However, this section's title and subtitle using "scraping" as a method. So, what does this tutorial propose as the data collection method? Scraping the front-end interface of YT or making API calls for its API? From the former, one extracts data, while from the latter, one requests and retrieves data from an API. To help the reader understand and follow up on the tutorial and data collection method, I recommend the author make this clear. If it helps, I'm happy to share the pdf files of two-part guides offering an overview of the knowledge needed to collect data using APIs (see: https://dx.doi.org/10.4135/9781529611441 and https://dx.doi.org/10.4135/9781529611458).
-
As for research ethics, AoiR provides good guidelines that the author should consider including in the tutorial, i.e. Markham AN, Buchanan E (2012). Ethical decision-making and internet research, recommendations from the AoIR ethics working committee (version 2.0). Retrieved from: http://aoir.org/ reports/ethics2.pdf and Markham A (2017). Impact model for ethics: notes from a talk. Retrieved from: https:// annettemarkham.com/2017/07/impact-model-ethics/. Therefore, providing more concrete perspectives to the reader
-
The subsection "Introducing the Wordfish Text Mining Algorithm" could have appeared sooner in the text, as it explains the main objectives of the tutorial and what is necessary to do for those interested in it.
Part II: Scraping the YouTube API
-
Please see my comments and suggestions on the data collection method to reconsider the data collection method used in the title. Keep in mind that one makes API calls to request and retrieve data (not to scrape data, for this a web scraper would do the job) ;)
-
Maybe provide a table showing the API quota limits to comments based on YT Data API? This could be a valuable source for the reader.
-
Bullet points and a short title can help when explaining step-by-step procedures, for example:
How to create YT credentials? 1. First, xxxxxxxx 2. Then, xxxxxxxxx 3. Finally, xxxxxxxx
Screenshots, gifs or short videos are often super helpful for these tasks.
-
Suggestion: maybe rephrasing this subtitle "Making a list of videos" to something like "How to create YouTube comments dataset?" Also, it would help to provide a visual protocol summarizing all possibilities for this type of dataset building (i.e. video or channel ids and keywords as entry points for retrieving comments), while using the video comments as a practical example. This visual protocol should also include the requirements for using predictive modelling.
-
A nice proposal the one of the code chunk to combine video metadata with the comment text and comment metadata while renaming some columns for clarity :)
Part III: Optimizing YouTube Comment Data For Wordfish
-
Recommendation: Use "Now that the comments are retrieved" rather than "scraped".
-
Beyond explaining how Wordfish models work (great job here!), I recommend the author provide a concrete and short example, so the reader can also perceive how it works in practical terms. This would help one to envision the methodological potentials of Wordfish.
-
I wonder if Wordfish models read mentions (@name), hashtags, links, and emojis. These relevant and valuable objects and actions would bring more context and richness to the textual analysis. Instead, the analysis automatically ignores YouTube usage practices in comments by removing mentions, hashtags, links and emojis. I want to invite the authors to reflect on this matter.
Part IV: Modeling YouTube Comments in R with WordFish
-
The opening section brings a detailed explanation about WordFish, comparing this model wit topic modelling. It is an excellent subsection because it provides further information about the model. However, it should have appeared earlier when the author introduces WordFish. At this point, the reader is expecting to see run the model and check the outputs of WordFish for YT comments, learning how to read and interpret them.
-
I wonder if the authors suggest an alternative visualisation to analyse the top words. Perhaps, one that would avoid overlapping the words and facilitate its interpretation.
Overall, I think the tutorial is very useful and well put together. I agree with many of Dr. Omena’s suggestions regarding clarity and usability. If following the tutorial one piece at a time, a user is likely to successfully gather and visualize this data. However, I think an alternate organization that begins with a clear justification for this method, overviews the functionality of each tool, and then ends in a more clear-cut set of directions would be useful. I also think more or different examples (see comments below) would add to the clarity of the tutorial and help researchers know when and how to employ these methods. I think most of my suggestions or comments come back to that idea: a tutorial is most useful when a user knows when and why to deploy it, not just how. I think more clarity in that regard would be useful.
Opening: It would be useful if there was a justification for why a person would need to know using R or scaping data/crawling. I think there is less need to justify YouTube as a site of research and more need to demonstrate to the audience what kinds of questions/problems this method of scraping and visualizing data might address. If the audience is scholars who don’t know how to use these kinds of methods, they’ll need to be able to make the connection between this method and their overarching goals.
P35-36: This section is a bit confusing. It’s unclear how many words/comments would be needed to do a successful model—an example would be useful here. Maybe a link to a video with comments that are successful or linked csv that demonstrates comments.
P38: There is guidance here about what is “better”, but what is “better” is determined by a researcher’s purpose and dataset. The use of “better” here is somewhat confusing to me because I am not sure what the goal is. If the goal is to model differential data points, then this makes more sense, but I think I’m confused about the purpose of the model.
P54: If a person is doing the tutorial as they are reading it, I think they’ll do ok with keeping up with the steps. However, it would be challenging to retrace steps or review steps quickly because steps are so embedded. It might be nice for there to be a clearer set of step-by-step instructions to revisit so that a person isn’t looking for signal phrases like “Now that you’ve finalized your list of videos, gathered the metadata and comments for each video, and optimized your data for Wordfish through filtering, you are ready to clean and wrangle the data!”
P62-63: It would be nice to get a clearer sense of which “approaches” are described here and when/why a person might choose one over another.
P87: While I understand that a user can manipulate the visualizations crated in WordFish, these models don’t feel especially useful to me because they are illegible. I think demonstrating some simplified visualizations might be useful and/or providing alternate formats where the visualization is more usable.
Thanks @jannajoceli and @hlang264 for your helpful feedback! @nabsiddiqui is there anything you were going to add that we should take into consideration before starting to revise?
@hawc2 I was just going to summarize what the other reviewers said, but I think you can go ahead and start revising based on these suggestions. If there is anything you think isn't particularly relevant, we can discuss that later if you don't want to put it into the revision.
hey all, just to say we've almost finished revising this lesson. I'm going to do some last tweaks and test of the code next week, and then I'll ping you all when it should be ready for a final review.
A note to say that the lesson markdown file has been renamed, and the lesson slug:
adjusted accordingly.
- Preview: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/text-mining-youtube-comments
- .md file: /drafts/originals/text-mining-youtube-comments.md
- images: /images/text-mining-youtube-comments
- assets: /assets/text-mining-youtube-comments
I have also:
- Renamed the /images directory (to match the new lesson slug)
- Renamed + resized the images (to a maximum of 800 pixels on the longest edge)
- Added alt-text placeholder text
- Adjusted the liquid syntax so that the image file names are correct
- Adjusted numbering of endnotes (they were not in consecutive order)
- Moved endnotes to end of document (as is our convention)
Thank you @anisa-hawes! @nabsiddiqui, I think this draft should be ready for final review by the peer reviewers.
I'm curious to hear how it goes when you test the code, it seems to be working on our end, but things can get especially wonky with the first part about accessing the API.
@nlgarlic and I are available to make further revisions and updates, and are happy to chat about any of our decisions here. We did include information on setting you an account to access the YouTube API, but we are still leaning a bunch on the Google documentation because it keeps changing.
Thanks everyone for your patience with the time it took us to update this lesson, and we look forward to next steps!
Hey @hawc2. It will likely be about another two weeks until I get to this, but I have placed a reminder in my calendar to come back to it soon.
@nabsiddiqui any updates on this lesson's timeline from here?
Hi there,
Thanks for your patience with this. I've just had a chance to review the changes, and I think the final draft is a great contribution. Thanks for your work on this--I'm excited to see how folks use it. I think this is ready to move on to the next stage.
Best wishes,
Heather
Hello @nabsiddiqui @hawc2 and everyone. Thanks for this lesson, it would be great to see it published - especially since people who - like me - can no longer collect easily twitter data, are looking for ways to collect other social media data. Thus, if I may, I would like to ask how the code could be adapted to collect data from the live chat comments available sometimes on the right side f the video (and not from the comments that appear under a video). I do not find a lot of resources out there explaining how to do this. Thanks a lot!
@nabsiddiqui I'll meet with my co-authors this week to discuss final revisions we will make to the lesson. We are aiming to complete revisions for this lesson within the next few weeks. Please let us know if there's anything additional we should take into consideration at this time. Partly what we aim to do is make sure nothing has changed in the YouTube API that could affect the lesson now.
@spapastamkou thanks for your encouraging thoughts! Agreed that it's valuable to offer tutorials on other social media platforms less restrictive than X/Twitter. I'll chat with my co-authors about your question regarding live chat comments, I remember us looking into this at one point and thinking it seemed doable but outside the scope of this lesson. We could at least add some brief info about that in the lesson to point people in the right direction.
thanks @hawc2 !
I looked through the wording, etc. and I think we can now close out this portion of the review process. So, I think we are ready to start working on next steps @anisa-hawes and @hawc2
@nabsiddiqui @anisa-hawes just a heads up, we are making final edits on the lesson now. We've done some thinking and made a few changes to the part of the project requiring access to the Google API that I think will make the lesson much more sustainable.
Once this round of edits is done, I'll hand it over to @anisa-hawes for a final look over. @anisa-hawes I think it will be necessary for you to take on the ME role in this case, doing one last read through of the lesson for quality control, and giving us a last round of edits before it is sent to copyeditor for preparation to be published. Let's aim to publish this lesson in early 2024?
Hey Charlotte, I thought Anisa was going to do a read through and give us feedback first? It's ok if she wants to do it after copyedits, but I just want to flag that since I am the author on this, I can't do the final review as ME, so I was hoping Anisa could provide us feedback for one last round of review. I also have a few edits I need to make that I can make as soon as today if you are about to go into copyedits. Let me know if I can still do that or if I should wait. My recommendation in the future would be to double check with authors they are done editing before bringing things into copy edits, as my last communication with Anisa via slack had not suggested this was ready for copy edits quite yet. Best Alex
On Fri, Feb 16, 2024, 5:42 AM charlottejmc @.***> wrote:
Hello @hawc2 https://github.com/hawc2, @jantsen https://github.com/jantsen, and @nlgarlic https://github.com/nlgarlic,
This lesson is now with me for copyediting. I aim to complete the work by ~Friday 08 March.
Please note that you won't have direct access to make further edits to your files during this Phase.
Any further revisions can be discussed with your editor @nabsiddiqui https://github.com/nabsiddiqui, after copyedits are complete.
Thank you for your understanding.
— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/374#issuecomment-1948142664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EEKXFI3O37WA5UHP3LYT4ZZTAVCNFSM46VS3TN2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJUHAYTIMRWGY2A . You are receiving this because you were mentioned.Message ID: @.***>
Hi @hawc2, yes, Anisa will be adding her feedback shortly!
I was a little hasty and initially posted a comment + opened a branch to start copyediting, but I quickly deleted both as I realised I wanted to wait for Anisa's comment before I began my work. I imagine the email updates from GitHub might not have reflected that on your side.
There's no rush for you to make edits just now.
My apologies for the confusion!
Hello @hawc2,
Thank you again for the opportunity to read this co-authored lesson text-mining-youtube-comments
. I think it is excellent. I can see that you've made some substantial revisions (https://github.com/programminghistorian/ph-submissions/commit/c4d044d0103461c93497ce066fadb2e10a779d42 and https://github.com/programminghistorian/ph-submissions/commit/bea9140a7fc2bb014a4555f6e22fdd2dd5c89461) since I shared my feedback with you by email.
Mapping my initial feedback against the current draft here on GitHub, I thought it might be useful to set out my remaining suggestions as optional tasks / or ideas for you to think-through and reject. I'll check off the suggestions/points which I think you have already resolved.
Overall, I suggested that the lesson structure might be revised so that table of contents is simplified. At the moment, the sub-sections are very granular and, from my point of view as a reader, this made navigation quite confusing. I think the following revisions could enable you to provide readers with a broader overview of the lesson as a whole, as well as clear signposts into specific parts. It seems to me that the high-level chapter headings could be:
Introduction #comprising your overview of the method + ethical considerations
Data Collection #comprising set up + install + getting started with the tools to download metadata and comment
Cleaning and Modeling #comprising removing stopwords, applying filters to clean your data, modeling the columns
Analysis
Visualisation
Conclusions
Endnotes
Anyway, these suggestions are grouped by section headings as they are (rather than paragraph/line number, because everything has shifted since I worked through this).
Introduction
- [x] Add direct links to sections (using this syntax
[mention of section](#section-title)
) and add definitions of technical terms - [x] Summarise the three lesson outcomes here
YouTube and Discourse Analysis
- [x] Make this the opening of your lesson overview, rather than a stand-alone section.
- [x] Why are YouTube comments an interesting source of research data? You could consider asking this question more explicitly.
Focus on establishing three things:
(1) How can this source of [YouTube] comments data be useful in research?
(2) Which methods for collecting/modelling/analysing/visualising the data have been chosen in this lesson and (3) why? - [x] Add initially: YouTube was initially associated…
- [x] Edit out the initial clause which compares YouTube to Twitch
- [x] Stick to the term “video-sharing platform” throughout the lesson. I think ‘sharing’ successfully encompasses the activities uploading, viewing, commenting.
- [x] Remove ‘run the gamut’ and other phrases which would be unlikely to be familiar to non-native English speakers
- [x] Remove paragraph that is specific to YouTube’s current user interface
- [ ] After “recent scholarship”, link to a selected bibliography of articles to recommend to readers
- [x] Move the sentence “YouTube video comments represent a unique body of text, or a "corpus" of discourse, describing how viewers receive and perceive politically charged messages, often from moving image media.” upwards, especially as an answer to the question (1) How can this source of data [YouTube] comments be useful in research?
- [x] Remove inverted commas from "corpus". I don’t think this needs inverted commas or quotation marks because you explain it with ‘body of text’.
- [x] Merge paragraphs 7 and 4 (as they seem to repeat each other) - (Completely understand if you are inclined to leave this as is)
- [x] Add “ future”: These comments often frame future viewers’ encounter with the video content, influencing their interpretations, and inviting participation in the discourse.
- [x] I think it is particularly interesting and critically relevant to specify that comments appended to videos can influence future viewers’ encounters. The other interesting chronological factor, is that users’ encounters can be days, months, and years apart. This means that dialogue in comments may be an immediate back-and-forth between individuals, but also can involve extended hiatus and reactivation of discussion between a different group of participants.
Learning Outcomes
- [x] Move the three-stage outline of the lesson’s structure to the beginning of the lesson
Data Collection
Ethical Considerations for Social Media Analysis
- [x] Shorten this section / or integrate it within your introduction
- [x] Pair some of the open questions with practical steps a researcher would take in their work in response
- [x] Reflect on one or two actions you took in this case, which were directly informed by the ethical questions you outline here. For example:
- [x] Did you seek content creators’ permissions and how?
- [ ] Did your research group include researchers from the communities represented by this dataset?
- [x] When pointing readers to resources, weave in a sentence of two that draws upon your experiences in this case NOTE: ## Installing R and RStudio has been moved down below # Set Up your Coding Environment, instead of in the new section # Data Collection
Accessing YouTube Data
- [x] Reword and reorder as per the suggestion
- [x] Move “All you need is […]” downwards in the lesson so that it is part of the specific instruction on locating IDs. I’m noting here that there doesn’t appear to be a section titled Keyword Searching, so as things are currently, I wouldn’t know where to go as a reader.
- [x] Add section on Keyword Searching?
Video Selection
- [x] Reword and reorder as per the suggestion
- [x] Clarify whether ~6 videos is considered an optimal number of videos to begin with (provided they have the specified range of ~2000+ comments), or if researchers using this method might also experiment with sets of ~10 or more videos
Downloading Comments and Metadata
- [x] Make each paragraph slightly clearer to surface direct instructions
- [ ] Add a screenshot of the YouTube Data Tools interface.
- [x] Or alternatively (perhaps preferably), a brief orientation for readers. For instance, an explanation of each of the fields in the Parameters section
- [x] Answer the questions:
- [x] Can I only enter one video ID at a time, or can I make a list? (separated by commas?)
- [x] What do you recommend I enter into the Limit to: field?
- [x] Move p.35 (“For ethical purposes [...]”) up to the introductory section which covers ethical research recommendations
- [x] Briefly outline suggestions for each of the Output options
- [x] Last four paragraphs (from “You have three choices [...]”) reword and reorder as per the suggestion
- [x] Confirm whether ID numbers are always 11 characters long
Set Up your Coding Environment
Install R and RStudio
- [x] Reword as per the suggestion
- [x] Write a simple instruction for the reader to download the script and load it into RStudio Desktop instead of pointing to PH assets
- [x] Move the sentences about YouTube Data Tools to the specific section about YouTube Data Tools.
Install R Libraries
- [x] Explain whether creating a new R script will be based upon/adapted from the R script provided
- [x] List the necessary packages to install as bullet pointed items inside a grey box. (If you like this idea, @charlottejmc and I can take care of formatting it for you during typesetting)
Import Data
- [x] Reword as per the suggestion
- [x] Add colons
- [x] “Next, load in files containing video data:”
- [x] “Now, pivot this data so it is organized by row rather than column:”
- [x] “Finally, run the following code to join the video and comment data:”
- [x] Add a link to “Alternatively, if you would like to utilize our sample data, you can download it from the Github repository.”
- [ ] Share a template displaying the required column names + order (in raw Markdown) for readers who “choose to use a YouTube comment dataset downloaded with a tool other than YouTube Data Tools”. (If you let us know the column names + order, @charlottejmc and I could create this in raw Markdown for readers).
Data Labeling
- [x] Reword as per suggestion
- [x] Add the link to “add a [partisan indicator](LINK to definition)”
Pre-processing and Cleaning
- [x] Reword as per suggestion
Remove Stopwords and Punctuation
- [x] Explain the use of the unusual stopwords words "bronstein", "derrick" and "camry". These are such unusual words, it makes me wonder why they wouldn’t provide any meaningful information? (Perhaps I am missing something?)
- [x] “Using the
stringr
package [...]” reword as per suggestion - [x] Move code block up to follow the sentence “in the following code”
- [x] “Note you can also clean the data” reword as per suggestion
- [x] Link to the specific lesson section which contains direct instructions for using
quanteda
(“at a later stage”) - [x] Rephrase and add colon “To export, use the
write_csv
function below:” - [x] Add “for analysis” (“This data can now be transformed into a Wordfish-friendly format for analysis.”)
Wordfish
- [x] Edit out the two initial paragraphs. These are orientation paragraphs, but I think what you have is very clear if you revise/simplify the chapter headings.
Interpretation
- [x] “Secondly, Wordfish identifies which specific [...]” reworded as per suggestion
- [ ] Still need to add link to definition
- [x] "The Wordfish model lends itself to two different kinds of visualizations, one which depicts the scaling of documents, and the other the scaling of words. This lesson provides code for both visualizations [in the Visualisation section below](LINK to section)."
- [x] I think I was suggesting moving an earlier sentence to here or merging these two paragraphs together to edit out the repetition, and clarify which kinds of visualizations you are going to demonstrate. Something like:
The Wordfish model lends itself to two different kinds of visualizations, one which depicts the scaling of documents, and the other the scaling of words. The below code will create ‘word level’ visualizations which display how terminology is dispersed across the corpus object.
Our project uses custom visualizations, drawing from Wordfish’s underlying statistics and utilizing
ggplot2
. To produce the first type of visualization, run the following code and produce a plot of all unique comment words within the corpus:
Latent Meaning
- [x] “Since YouTube comments are short, you may find some specific examples helpful. [Can you point to examples here?]” - HAS BEEN DELETED, SO SUGGESTIONS DON’T APPLY ANYMORE
Document Feature Matrices (DFM)
- [x] Edit out the orientation sentence which begins this paragraph. It seems to me that these paragraphs are still about Analysis.
- [x] Delete the sentence: ‘These models do not take into account any information about word order’ which is expressed in the previous.
- [x] “Bag-of-words modelling can be problematic [...]” and “The key differences between Wordfish scaling [...]”: reword as per suggestion
- [x] Move reference to “the code below” nearer to the actual code block
Create a Corpus in R
- [x] Use DFM only (capitalised)
- [x] At “introduced above”, link back to the section that explains partisan indicators
- [x] Stick to the term partisanship “indicator” OR “variable” - HAS CHOSEN “INDICATOR
- [x] Clarify whether a ‘document’ = a single comment or plural comments
- [ ] Link to definitions of “stemmed” and “lemmatised”
- [x] “At an earlier stage of this lesson”, link to that particular section
- [x] Clarify “remove comments with minimal data” – Was this where we filtered out comments with fewer than 10 words?
- [x] Clarify which are “the two options” for algorithms the reader can choose. However, the algorithms work slightly differently, so you should test which works best for you and your data - and there’s no harm in using both. Which are the two options you advise readers test?
- [x] What kinds of dataset might be better suited to which option?
Select Comments
- [x] Remove two repetitive and unclear paragraphs (which appeared before this subheading)
- [x] Clarify paragraph and explicitly state the columns that should be included
- [ ] If we choose to share a template displaying the required column names + order, link back to that here
Build Corpus Object
- [x] Add colon “Execute the following code to build your corpus:”
Data Transformation
- [x] Use DFM only (capitalised)
- [x] “Next, we will tokenize [...]” Reword as per suggestion
Data Optimization
- [x] “Now, we will optimize the corpus [...]” reword as per suggestion
Verification
- [x] (From deleted section ### Verify Top 25 Words) reword as per suggestion
Build Wordfish Model
- [x] Move the sentence “The following code creates a Wordfish model based on the corpus of unique comments you have assembled” so that it directly precedes the code block
Unique Words
- [x] Add colon “produce a plot of all unique comment words within the corpus:”
- [ ] Figure 3. adjust scale + text colour in this figure
- [x] Could add a figure which zooms in on a particular area of the word arc
- [x] From “On the left, "knee" and "neck" [...] although it is risky to read too much into any single finding.” – very context-specific, link to a contemporary news article in a web archive which explains what happened to George Floyd
Removing Outliers
- [x] “We’ve circled in red the words above that stand out in the first visualization”: refer specifically to Figure 3, and add alt-text in Figure 3 to define the words that have been circled
- [x] Add colon to “this code re-runs the Wordfish model and visualizations.”
- [x] “[...] the partisan indicator described above” – link back to section
- [x] Stick to “partisan indicator”