ph-submissions icon indicating copy to clipboard operation
ph-submissions copied to clipboard

Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA)

Open tiagosousagarcia opened this issue 3 years ago • 44 comments

The Programming Historian has received the following proposal for a lesson on 'Interrogating a National Narrative with Recurrent Neural Networks' by @ChantalMB. The proposed learning outcomes of the lesson are:

  • Understanding how general-purpose neural networks like GPT-2 can be applied to the distant study of large corpora in a way that can help guide further close readings, while also acknowledging the technical and ethical flaws that come with using large-scale language models
  • Creating a workflow for performing large scale computational analysis that works for the individual learner through advancing their technical knowledge on topics of machine learning software and hardware required to perform these kinds of tasks

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than 24/01/2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by 24/01/2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

tiagosousagarcia avatar Nov 02 '21 15:11 tiagosousagarcia

@svmelton and I discussed potential editors for this article. Sarah will seek an editor from the EN team when the article arrives. I note that @ChantalMB has emailed reporting some technical issue completing submission.

drjwbaker avatar Jan 25 '22 15:01 drjwbaker

This lesson has now been submitted and staged here https://programminghistorian.github.io/ph-submissions/en/lessons/interrogating-national-narrative-gpt and is ready for technical review prior to peer review. Many thanks to @ChantalMB for submitting on time!

drjwbaker avatar Jan 26 '22 10:01 drjwbaker

The link for the staged submission has been updated and is now here: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/interrogating-national-narrative-gpt

I'm starting the technical review now, will come back with any comments, corrections and suggestions shortly

tiagosousagarcia avatar Feb 17 '22 10:02 tiagosousagarcia

@ChantalMB, congratulations on an excellent tutorial -- this is a challenging and exciting topic (and something very close to my current work), and you've tackled it beautifully.

Apologies for the delay in moving your article forward -- there were a few personal reasons that kept me away from this for a while.

I've made an initial technical review 757f056, making small changes as I went along. A summary of these are:

Changes made

  • l. 30 -- changed single to double quotation marks (consistency) (and in other places, silently)
  • l. 30 -- added wiki link to machine learning
  • l. 30 -- added wiki link to artificial intelligence
  • l. 30 -- ie -> i.e.
  • l. 32 -- changed single to double quotation marks (consistency)
  • l. 36 -- "As the first half implies, " -> deleted (clarity)
  • l. 36 -- changed single to double quotation marks (consistency)
  • l. 38 -- "official" -> "officially"
  • l. 60 -- "ex" -> "e.g."
  • l. 64 -- added wiki link to multi-core processor
  • l. 68 -- "propriety" -> "proprietary"
  • l. 85 -- added link to conda enviroment
  • l. 183 -- "outline" -> "outlines"
  • l. 319 -- capitalised and italicised PH

Additionally, I would also suggest you consider the following before we send your article for peer-review:

Suggestions

  • l. 42 -- add a local copy of the dataset to PH (can be done now, if no further changes to the dataset)
  • l. 42 -- add references to other PH tutorials on relevant methods (i.e., web-scraping, data cleaning, for example)
  • l. 44 -- explain what is meant by 'prefix functionality'
  • l. 46 -- requirements should probably appear earlier in the lesson (after overview?)
  • l. 52 -- "you may use an online service that offers cloud-based GPU computing" -- add a note to explain that a few examples will be discussed in more detail in a section below
  • ll. 85-95 -- should the virtual environment have a more descriptive name than "gpt2"? I.e., users who dabbled before (or may wish to dabble again), would benefit from a clearer label
  • l. 155 -- add link to more documentation on GPT-2's different options
  • ll. 185-188 -- add a few notes explaining why you chose those specific default values, more detail on units (i.e., are they all in steps?), and more detail on 'learning_rate'
  • l. 197 -- Google Collab took 37 minutes for me; I wonder whether it would be worth it being a little more general here (i.e., execution times will vary), and just offer a minimum time (i.e., at least 20 min)
  • l. 258 -- link to example output text is missing
  • ll. 252-302 -- really like this section. Presents a good example of how to use a generative language model to interrogate a media narrative. A few more things that I would like to see addressed here would be a rough hit rate for useful generated text (i.e., how many generations until something interesting came along) and your process to determine what makes a particular generation interesting. I understand that this is a thorny and complex issue that falls slightly outside of the scope of the article, but it might be useful to acknowledge some of these complexities here. Another thing that could be addressed here (or elsewhere), is hints at other scholarly uses for AI-generated text.
  • ll. 303-322 -- another excellent section. A couple of notes here: 1) it might be useful to point towards other text-gen AIs, some of which attempt to at least address some of the concerns around OpenAIs practices and base model (Eleuther, for example) 2) The last paragraph (l. 321) reads more like a conclusion rather than an ethical discussion. I would probably separate it and expand it slightly with this in mind.

Thank you for all your hard work!

tiagosousagarcia avatar Feb 17 '22 12:02 tiagosousagarcia

@ChantalMB Would you be able to get these small changes done in the next couple of weeks? If so, that'll give us some time to assign an editor (pinging @svmelton) before getting it out to peer review.

drjwbaker avatar Feb 17 '22 16:02 drjwbaker

Thanks, @drjwbaker! @jrladd will serve as an editor for this piece.

svmelton avatar Feb 17 '22 19:02 svmelton

Thanks @tiagosousagarcia for the initial technical review! @drjwbaker I should be able to fully review + get these changes done over the course of next week!

ChantalMB avatar Feb 18 '22 05:02 ChantalMB

(I am co-assigning myself here so that I can shadow the editorial process).

anisa-hawes avatar Feb 18 '22 11:02 anisa-hawes

@jrladd so pleased to have you as editor on this. Note that this article is part of a special series for which we have funding. As a rresult @tiagosousagarcia and I will be offering additional support. For example, in addition to doing the technical edit, Tiago has identified potential peer reviewers. So do write to us https://programminghistorian.org/en/project-team when you are ready!

drjwbaker avatar Feb 18 '22 11:02 drjwbaker

@tiagosousagarcia (or anyone who may have the answer) In adding in references to other PH tutorials, I've discovered that the "Intro to Beautiful Soup" tutorial has been retired. Is it acceptable to link a retired tutorial? This is actually the tutorial I used to learn webscraping with python, therefore it is the most accurate in terms of what I'm mentioning in my tutorial!

ChantalMB avatar Mar 09 '22 02:03 ChantalMB

@ChantalMB -- in principle, I would avoid linking to retired tutorials. However, taking into account two things 1) The lesson was retired because the example used was no longer available, rather than because the technology itself has been superseded, and 2) I don't think (I may be wrong) we have another entry level beautiful soup tutorial available, it might be ok in this case (as long as the link is contextualised). I may be very wrong though -- @drjwbaker, @jrladd, do we have a strict policy about this?

tiagosousagarcia avatar Mar 09 '22 15:03 tiagosousagarcia

If we are sending people to a retired lesson to do it, we shouldn't do that. If it is merely to reference a point there or to understand a principle, it is fine. Retired articles are retired because they no longer work (and can no longer be made to work) rather than because there is anything terrible about what they were trying to achieve when active.

drjwbaker avatar Mar 09 '22 16:03 drjwbaker

Hi @ChantalMB -- I wonder if we could get a sense of when you expect to have these revisions ready?

tiagosousagarcia avatar Mar 14 '22 16:03 tiagosousagarcia

@tiagosousagarcia Revised tutorial is ready now, actually! Just wrapped up everything today-- apologies for the delay, got knocked out by cold for a week!

I also never got notified by GitHub that you and @drjwbaker had responded to my question, so thank you! I ended up linking the Beautiful Soup tutorial, but then also added a link to the "Automated Downloading with Wget" tutorial as an alternative resource for specific instruction since wget can also be used to download web pages.

I did quite a large edit re: the suggestions for lines 185-188 because to expand on learning rate meant that I also had to explain gradient descent, so now my tutorial includes diagrams. Should I be sending the revised article + attached images by email or attached to a comment in this ticket?

Similarly, you stated that a local copy of my data could be made; if possible, I'd like to do that for the training data and also the output text (the missing link at line 258).

Thanks for your help in advance!

ChantalMB avatar Mar 15 '22 00:03 ChantalMB

Great news @ChantalMB! Hope you are feeling better now. The easiest option is probably to send me the corrected md + additional files via email, I'll add them here and link the commit to the discussion. You can also do the changes via a pull request, but still need to send me the additional files separately.

tiagosousagarcia avatar Mar 15 '22 08:03 tiagosousagarcia

@tiagosousagarcia Just sent everything your way via email!

ChantalMB avatar Mar 15 '22 17:03 ChantalMB

@programminghistorian/technical-team or @anisa-hawes, I wonder if anyone could give me a hand here: since the location of the drafts has been changed, it seems that the link for local datasets is broken in the preview -- is there any trick to referencing it that I'm missing? Currently I have it as /assets/[LESSON-SLUG]/[FILE-NAME].EXT -- I know other lessons also suffer from this problem (at least #416)

tiagosousagarcia avatar Mar 16 '22 09:03 tiagosousagarcia

Hello @tiagosousagarcia. Hmmm. This is strange... Let me take a look... When we made changes to the directories where the lesson .md files are saved, we didn't make any changes to the images or assets directories. The URL format should indeed be:

/assets/lesson-slug-here/asset-file-name.ext

anisa-hawes avatar Mar 16 '22 15:03 anisa-hawes

Ah, so when the lesson is moved over to Jekyll for publication, we update any /assets or other internal links so that they are 'relative' links. Until then, we need to use full links, i.e., https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/interrogating-national-narrative-gpt/articles.txt.

anisa-hawes avatar Mar 16 '22 16:03 anisa-hawes

Thanks so much for these revisions, @ChantalMB. Just a quick note to officially introduce myself--I'm glad to be working with you all on this! We've reached out to potential reviewers, and once we know more I'll be back in touch about next steps.

jrladd avatar Mar 18 '22 18:03 jrladd

Thanks very much for your review, @lorellav! Just so you know, @ChantalMB, our typical policy is to wait for both reviews before starting revisions. So you can check out the review (#473) for now, and I'll be back in touch once there's a second review in.

jrladd avatar Mar 29 '22 11:03 jrladd

Hi @lorellav and @ChantalMB! Just to keep everything in one place, I'm copying @lorellav's review here. I'm also closing the other issue, and once we're ready to discuss this and the next review, that can all happen on the same thread.

Thanks for the opportunity to review this tutorial, which I found topical and well-explained. Please find here below my comments and very few suggestions for improvement.

The author addresses a consistent model reader throughout the lesson, which ranges from absolute beginner to advanced beginner, not just in using GPL but in coding with Python and using the cmd line. While this is certainly helpful, I think in reality it is unlikely that absolute beginners would use this tutorial, for example being able to create a virtual env, or manage comfortably the different parameters, etc. My suggestion would be to choose a more intermediate model user to avoid frustrating beginners and absolute beginners and in general to keep the tutorial more honest.
The author does a good job explaining concepts and terms, however, a few notions are taken for granted (e.g., language models) and they are not really explained. Although many people may be more or less familiar with the notion of what a language model is, I believe many of the assumptions that go in the making of these models remain often unclear and they are in fact overshadowed by the idea that because these models are truly huge, they must be very representative and therefore, very reliable. I appreciated the ethical considerations section which indeed highlights the limitations of this methodology and these models, including the biases baked in these models. I think however that the section would greatly benefit from including some further details about implications particularly for historians or researchers working with historical texts/humanities. For example, what was the original intentions behind the creation of these models? What are the risks of using them for humanistic enquiry? The data section could be expanded a bit more, particularly because the author hints at the possibility to use own data by following the tutorial. I did use the tutorial with my own data, and here's what I found:

  • As it is, the model expects one .txt file. This means that those who have data in a different format need to prepare their data to fit this format. This may not necessarily be a problem for a python user who has worked with language data before, but this consideration is somewhat assumed. My suggestion is to include a few lines explaining this clearly, and point to resources (even in PH) that could help users get their data ready as the model expects it.
  • I used both google collab and jupyter as both options are given to the user. Mounting a drive and access data and files in collab is not super intuitive and requires the user to find additional/external resources, including tutorials and forums to find out how to do it. Perhaps a word of warning could be included here. -in jupyter I found the process of setting up the env and downloading the packages very lengthy, again a word of warning might help the user allocate enough time to run the tutorial.
  • I tried to tweak the parameters to achieve better performance. When changing the n_steps and batch_size I ran into an index problem. This is not at all mentioned in the tutorial, as the possibility to change the parameters is described as unproblematic. So perhaps mentioning that each change may run in errors and explain why/how to solve them (like the memory problem) could help.
  • FYI, in Jupyter, even after installing all the packages successfully, I got the error: CUDA is not installed.

In terms of payoff, the tutorial uses GPT-2 for language production, but it could be helpful to at least mention all the other tasks it is used for (for instance translation, as GPT-2 is essentially trained on English, a major problem for digital language injustice). As per the logical sequence of steps, I think mentioning earlier in the tutorial that GPT-Neo can also be an option would be beneficial to the user, perhaps with a reference to the ethical section.

Finally, in reference to the environmental impact of training models, it could be helpful to refer the reader to recent work done towards reducing it (see for instance https://aclanthology.org/2021.findings-acl.74/) because currently the impression is that such big cost for the environment is inevitable.

Thanks again for your important contribution and I hope you'll find my suggestions helpful.

jrladd avatar Mar 31 '22 18:03 jrladd

@kmcdono2 agreed to be the second reviewer for this tutorial, we can expect her comments by the 13th of May. Thank you everyone!

tiagosousagarcia avatar Apr 11 '22 06:04 tiagosousagarcia

Hello everyone! Looking forward to reviewing the lesson shortly!

kmcdono2 avatar Apr 11 '22 16:04 kmcdono2

Quick update from me: should have this done Tuesday, May 17!

kmcdono2 avatar May 15 '22 19:05 kmcdono2

Thank you @kmcdono2, and thank you for updating us!

tiagosousagarcia avatar May 16 '22 07:05 tiagosousagarcia

Hi @ChantalMB, @lorellav, and everyone! Just a quick note that I'll be away from email/internet until June 14, but you'll all be in very good hands until then. Feel free to continue to leave comments/questions here in the meantime.

jrladd avatar May 20 '22 13:05 jrladd

Hi all - apologies for the delay. Here is my review!

Audience

Does the author address a consistent model reader throughout the lesson?

  • Yes

Are some concepts or steps over-explained while other are under-explained?

  • Worth explaining language bias of the model (e.g. Bender rule) at top of lesson, not just in para 79.
  • Explain more concretely in para 9 what it means to ask the model questions. E.g. input is X, output is Y.
  • Fix this sentence in para 9: "By doing this, we can use this “computationally creative” technique can be used to uncover potential trends in the media coverage of this historical turning point, and consider further how this may be applied to other forms of historical research based on large-scale text-based data."
  • Fix this sentence in para 38: "To calculate the minimum number of steps your model should have divided the number of tokens in your dataset by 1024" (I assume it should be "divide" and not "divided".)
  • Fix verb in this sentence in para 75 - "As exemplified in this analysis, when studying the generated text, seek points of repetition among responses to the same prompt– why might these be the points which the model had clung to?"

Does the audience seem to match at least vaguely with other Programming Historian lessons? How is it new?

  • Yes

Getting Ready

What software / programming languages are required?

  • Python/Jupyter notebooks.

  • Para 18: Worth explaining to people that if they already have conda/Anaconda, they don't also need to install miniconda.

  • Worth also reminding people to update these packages if it has been a while since they used them.

What prerequisite skills are needed?

  • Basic python, notebook skills, understanding virtual environments.

What familiarity or experience is needed?

What data are needed? Is the dataset readily available?

  • txt data is available via Github as .txt file.

Skimmability

Are there clearly defined learning objectives or sets of skills to be learned listed near the top of the lesson?

  • I suggest breaking down the first paragraph in the Overview into more digestable text accompanied by a list of basic learning objectives. Some of the sentences are very dense!
  • I would say that a basic objective of this lesson is building trust in predicted results among historians. This is not to be underestimated! See my comments below about needing to frame this a bit more, to help the lesson user along in being convinced that generated text is indeed useful.

Are there useful secondary skills to be gained / practiced from the lesson?

  • Using virtual environments, using jupyter notebooks, using kaggle/colab

Do screenshots and other diagrams illustrate crucial steps / points of the lesson?

  • Screenshots/instructions for how to determine if the code can be run locally would be helpful for people new to working with large language models (circa para 14).

Do sections and section headings provide clear signage to the reader?

  • Add subheading for the Gradient Descent Algorithm explanation (beginning para 42)

Payoff

Does the tutorial suggest why the explained tools or techniques are useful in a general way?

  • It would be a good idea to introduce aitextgen: e.g. who made it, why, and what are alternatives. Basically, tell people about the tool that is the basis of the lesson a bit more. At the very least, include a link to the library (https://github.com/minimaxir/aitextgen).
  • More generally, I think there is scope in the beginning of the lesson to define "computationally creative" and provide some references to work that uses methods like these. E.g. at least cite https://computationalcreativity.net/home/ to point people to what the phrase is describing.
  • Furthermore, in the "Using Generated Text as an Interrogative Tool" section, I think there needs to be a subsection that sets up why generated text is an ethical/meaningful part of a historian's tooklit. In this section there are a lot of fascinating references to how we can examine the data, but it assumes the reader is on board with the idea that "for historical inquiry [it] is not necessarily just how coherent the outputted text is, but its ability to create patterns from the inputted text that we as humans may be incapable of detecting...". So effectively, I'm asking the authors to prepare the lesson user to get on board with the idea that what the model produces is useful for historical research. (I'm not arguing that it isn't, just that it would be really useful for this to not be taken as a given. While there is some discussion of this embedded in para 84, I feel this should be foregrounded and more extensive. The goal of para 84 is to get the reader to think about the perils of the potential for abuse using GPT-2, but there is a broader problem of utility when historians tend to be interested in things people actually said.
  • Finally, there are parts of the same section ("Using Generated Text as an Interrogative Tool") that ask the reader to be on board with the idea that the machine is being "convincing", "insightful", etc. I would suggest that the authors explain why they are using verbs like this and what the implications are in terms of how we rely on these results to think about a collection of text.

Does the tutorial suggest how a reader could apply the concepts (if not concrete steps) of the lesson to their own work?

  • Yes, though there could be more detail on the best way to prepare data. Para 8 mentions "it is important to perform some amount of cleaning so that your output is not hindered by stray stylings", and the earlier details in the para point to PH lessons for such tasks, but there isn't actually a clear statement about removing any extraneous tags, html, etc. in structured data. I would be as clear as possible about this and clarify that there should be 1 input file that concatenates text from many documents, not 1 directory with 1 "document" per file.

Workflow

Should a long lesson be divided into smaller lessons?

  • No, the length is fine.

Are there logical stopping points throughout the lesson?

  • Not really, because the actual code is quite short. It's fine as is.

If datasets are required, are they available to download at various points throughout the lesson (or different versions of them as the tutorial may require)?

  • Yes

Sustainability

Are all software versions and dependencies listed in the submission? Are these assets the most recent versions? If the lesson uses older software versions, does the author note why?

  • No, this could be improved. In particular, how sustainable is https://github.com/minimaxir/aitextgen?

If you have expertise in the specific methodology or tool(s) for the lesson, is the methodology generally up-to-date?

  • Perhaps worth engaging with/citing https://www.amacad.org/publication/non-human-words-gpt-3-philosophical-laboratory
  • The discussion of language models and ethics at the end is great.
  • See comment about adding reference in para 85.

What are the data sources for the submission? Are they included in a way that does not heavily on third-party hosting?

  • Input data can continue to be made available on Github (22MB file).

What kinds of other external links does the submission use?

  • Wikidata for definitions, or other explanatory articles such as content from Towards Data Science

Are these current or are there other, more recent or appropriate, resources that could be linked to?

  • Add links/references to "computational creativity" community/scholarship. It's a bit weird that this phrase is in quotations, but there is nothing in the lesson to situate it in the scholarship.
  • Add references to scholarship on semantic change and issues with using modern language models on historical text (para 85)

Integration

Does the lesson build upon an existing lesson and explain how?

  • No, but I imagine it might tie into other JISC lessons that are currently in the review pipeline.

Does the lesson tie into existing lessons and have appropriate links?

  • Yes: the Jupyter Notebook, Python Intro & Install, RegeX, Wget

kmcdono2 avatar May 27 '22 11:05 kmcdono2

Many thanks for the review @kmcdono2!

@ChantalMB, as you will have seen above, @jrladd will be off grid for a couple of weeks. Our policy is for the managing editor to summarise and guide the author on what the next steps will be, so I would hold off on any major changes until @jrladd is back and has had a chance to digest the reviews.

However, in the meantime, there are a few line-level edits that both reviewers suggested that you could have a look at. Thanks again to all!

tiagosousagarcia avatar May 30 '22 16:05 tiagosousagarcia

Thanks for the ping @tiagosousagarcia, will revise those small edits in the meantime. And thanks to all the reviewers for your thoughtful feedback, it will be immensely helpful in the process of improving and finalizing this tutorial 🎉

ChantalMB avatar May 31 '22 17:05 ChantalMB