ph-submissions Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA)

The Programming Historian has received the following proposal for a lesson on 'Clustering and Visualising Documents using Word Embeddings' by @jreades and @jenniewilliams. The proposed learning outcomes of the lesson are:

The ability to generate word embeddings from a large corpus.
The ability to use dimensionality reduction and clustering techniques for visualisation and analysis purposes.
The ability to use these steps to find and explore groups of similar documents within a large data set.

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than April 2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by April 2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Nov 02 '21 15:11 tiagosousagarcia

@hawc2 has offered to edit this piece.

Feb 02 '22 01:02 svmelton

Hi @jreades and @jenniewilliams, I look forward to reading your submission. Please let me know if you have any questions in the meantime. Feel free to email me or post questions on this ticket.

Feb 18 '22 20:02 hawc2

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

Mar 09 '22 14:03 jreades

If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good?

On Wed, 9 Mar 2022 at 09:59, Jon Reades @.***> wrote:

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/415#issuecomment-1063008848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA . You are receiving this because you were assigned.Message ID: @.***>

--

*Alex Wermer-Colan, PhD *

Digital Scholarship Coordinator

Temple University, Scholars Studio

Mar 09 '22 16:03 hawc2

It’s ok, I’ll have a go at finishing a proper draft before sending anything over — think I was feeling guilty that I’d been radio-silent for so long and wanted to have a “See, I have done work on this!” moment. ;-)

Jon

-- mob: 07976987392 email: @.*** skype: jreades On 9 Mar 2022, 16:55 +0000, Alex Wermer-Colan @.***>, wrote:

If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good?

On Wed, 9 Mar 2022 at 09:59, Jon Reades @.***> wrote:

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

— Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/415#issuecomment-1063008848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA . You are receiving this because you were assigned.Message ID: @.***>

--

*Alex Wermer-Colan, PhD *

Digital Scholarship Coordinator

Temple University, Scholars Studio — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

Mar 10 '22 09:03 jreades

I assume the first draft is submitted as an attachment to this issue... so here goes!

The article is a README from our private repo (will make a public version prior to publication): README.md

Images are here: EThOS UMAP_Output DDC_Plot Dendogram-euclidean-100 DDC_Cloud-c4-ddcBiology-tfidf DDC_Cloud-c4-ddcEconomics-tfidf DDC_Cloud-c4-ddcPhysics-tfidf DDC_Cloud-c4-ddcSocial sciences-tfidf Word_Cloud-c15-tfidf

Apr 06 '22 11:04 jreades

@jreades, I'll try to get the lesson set up, and I'll email you with more specific questions/issues with the files. More soon!

Apr 06 '22 15:04 hawc2

@hawc2 -- I can setup the lesson later today, if you haven't had a chance

Apr 19 '22 06:04 tiagosousagarcia

This was my bad — in discussing the submission with Alex it become obvious to me that I’d deviated a long way from a format that worked as a standalone tutorial. I’ve just this morning sent a substantially rewritten version that I hope will work a lot better: you can copy+paste the code ‘as is’ from the Markdown document to create a new notebook, but I can also supply a standalone notebook that is ready to run as well. We’ve discussed whether or not to split the tutorial at the point where there is a shift from word embeddings to dimensionality reduction, but until Alex has had a chance to have a look it’s TBD.

Apologies again for making this such a protracted, difficult process — as I said to Alex: I’d not realised the extent to which my approach to writing has been deeply reshaped by the academic article format.

Jon

On 19 Apr 2022, 07:39 +0100, tiagosousagarcia @.***>, wrote:

@hawc2 -- I can setup the lesson later today, if you haven't had a chance — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Apr 19 '22 08:04 jreades

no worries @jreades, it's all part of the process. I haven't seen any development here, that's why I was asking if there was something I could do -- but if @hawc2 has the matter in hands, then we are all good (though the offer to help if needed still stands)

Apr 19 '22 11:04 tiagosousagarcia

It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

@jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect?

I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review.

Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes

Apr 21 '22 02:04 hawc2

Definitely some rendering issues (around some of the maths especially) and a few typos that I’ve just spotted now (naturally).

If you can give me editing access I’ll get this tidied up today.

Jon

-- mob: 07976987392 email: @.*** skype: jreades On 21 Apr 2022, 03:34 +0100, Alex Wermer-Colan @.***>, wrote:

It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings @jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect? I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review. Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Apr 21 '22 06:04 jreades

I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference?

Apr 21 '22 13:04 hawc2

I’ve updated the tutorial Markdown file with links across to the public GitHub repo and Colab. Fixed the minor typos and one substantive content area that I wanted to correct. Have committed this back to:

https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/clustering-visualizing-word-embeddings.md

Jon

-- mob: 07976987392 email: @.*** skype: jreades On 21 Apr 2022, 14:19 +0100, Alex Wermer-Colan @.***>, wrote:

I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Apr 22 '22 06:04 jreades

This is looking like a very solid first draft. My main feedback is pretty general, so I’ll hold off from giving you specific line edits, and just ask for some broad revisions before we send out for review.

My main observation is that this is quite a difficult lesson, and more work will be required to translate terminology for beginner audiences, signposting where the lesson is going, and onboarding the reader to each phase of the methodology. It will be helpful for you to do some basic revisions in this direction before I send it out for reviewers, so they don’t need worry as much about how this lesson caters to its audience.

My only other concern is that this lesson is very long. Lessons usually don’t go over 8,000 words. I’d rather not see it bulge into a two part lesson, although that is a possible solution. For now, I’d encourage you to focus on the difficult task of editing this draft for both clarity and length, making it ideally more concise and more concrete at the same time.

As an example of clarifying your language for introductory steps, in your first Learning Outcome, you say: “we use a selection of nearly 50,000 records relating to U.K. PhD completions.” Right off the bat, you should use language that more clearly identifies what kind of data your tutorial works with. What kind of records are these? As an American, I’m not sure what “records relating to U.K. PhD completions” would look like, nor why someone would do word embedding analysis on this type of data. I would’ve expected “a corpus of doctoral dissertations” as the main dataset. In this vein, on Paragraph 9, where you introduce this dataset in more detail, it’s still not clear yet what “textual data” you will be analyzing within the “metadata” about dissertations. I have to admit that the section on the Case Study gets so technical and detailed about the metadata that I lost the main thread: What is the text you are going to model?

The part where you explain word embeddings and compare them to other text mining algorithms also requires more revision. In the learning outcomes, the tutorial jumps right into ‘dimensionality reduction’ and ‘hierarchal clustering,’ but maybe a preliminary learning outcome should be something about teaching the reader why these methods are appropriate next steps once you’ve created a word embedding model, in order to pursue a research question about the dataset. Putting it in these less technical terms will help readers understand how the algorithmic processes relate to broader scholarly work.

The subsequent paragraphs do a good job of distinguishing PCA, LDA, and TF-IDF from WEs, but they do assume that the reader knows something about what all these have in common. In these opening paragraphs, try to find more ways to spell this out, in terms of approaches like predictive modeling and latent meaning. For example, this sentence clause doesn’t really clarify what TF-IDF is, so its comparison with WEs remains a bit vague: “ The benefit of simple frequency-based analyses such as TF/IDF is that they are readily intelligible and fairly easy to calculate . . .” What seems essential to highlight here is the type of meaning WEs offer us insight into about the text the other approaches overlook. There’s some explanation in the Word Embedding section (beginning Paragraph 39) that helpfully explains why dimensionality reduction is necessary; a brief version of this could be included early on in the tutorial to explain why that tutorial leads the reader through this specific series of steps. Similarly, under Prerequisites, you explain how this lesson differs from the Scikit Learn Clustering lesson, but you don’t really explain first what the two lessons have in common. Alot of these comparison examples are useful for clarifying what your lesson on word embeddings does, but ideally they’d all occur in one section, and focus mostly on clarifying what word embedding analysis can show about the text.

In this context, the Word Embedding section, in particular paragraphs 40-44, jumps very quickly from the mathematical to the semantic. Could you spend more time here explaining the analogical nature of word embedding models and vector relationships?

The Sample Output section similarly jumps right into the weeds. Could you have a little more introductory info here about the outputs, and how this is a useful sample for elucidating some key points?

A couple of your Tables take up a lot of real estate. Could they be condensed? Table 3 for example, could just be one row? Table 5 is also long.

Paragraph 57 - I agree this is a good break point. You can remove the signpost you put here for review. I think the next section on Words to Documents is very useful, and could be contextualized a bit in terms of Word2Vec and Doc2Vec or how this method differs from those. I see why it’s useful to get into Manifold Learning, but if TSNE-UMAP is the main point, you should get to that sooner. I kinda got lost in this section. Generally I think you go into too much behind the scenes background detail about alternative options, and not enough info on the specific thing you are teaching. Try to offload some of the secondary comparisons with other methods to footnotes.

The Visualization section seems like a good place to conclude. Right now that Figure isn’t rendering in Markdown. But Visualizing and Clustering the data ideally could’ve been foreshadowed earlier in the lesson. The current version of these sections can be condensed to focus on concluding the lesson with some first steps in these visualization directions. What about this section is really essential to this lesson? Is all the validation and related steps necessary, or could that be included as supplemental material on your Github repo and Colab notebook for more advanced users? There’s a bunch, like the Confusion Matrix, that just seems so dense and complicated, that you’d have to do a lot more work to justify its inclusion for the proof-of-concept word embedding methodology. Since that would take up more space, so I’m inclined to think a bunch of it can be removed?

If you can try to take a shot at edits along these lines in the next couple weeks, after one round of revision, it’ll be ready to send out for review. Let me know if you have any questions

Apr 26 '22 14:04 hawc2

I've nearly finished -- I just need to review the final bits of analysis in light of the edits above, but have been able to prune the tutorial down to about 9,800 words. I've fixed issues with maths rendering (GItHub doesn't actually do this directly for Markdown) and tried to generally tidy up.

May 03 '22 11:05 jreades

Done. I've gone the whole way through and yanked as much as I think we can while preserving the overall intention of the submission. I'm sure there's more that could be done but I'm not able to see it at this point. The commit is in. The only thing I wasn't sure about is the images: I can see that they eventually go into an images/<tutorial_name>/ folder but figured you'd want to do this yourselves.

Let me know if you need anything else or have any further comments/ideas before sending out for review. As you can see, your initial comments prompted a major rethink and I hope you'll think we've done a good job acting on them.

May 03 '22 20:05 jreades

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

May 04 '22 08:05 jreades

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

I've now fixed this. This is ready for a review... I hope!

May 06 '22 09:05 jreades

@jreades regarding the images, let's make sure those are all rendering correctly. I put them all in the directory: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

Can you make sure your markdown file has each embedded in the appropriate place with alt-text? You can see information on naming the image files and inserting them into the markdown here: https://programminghistorian.org/en/author-guidelines

Once the lesson is rendering correctly, I'll do one last skim and send it out for peer review. Thanks so much for your thorough edits

May 06 '22 13:05 hawc2

Done. <fingers crossed I’ve done it right>

-- mob: 07976987392 email: @.*** skype: jreades On 6 May 2022, 14:54 +0100, programminghistorian/ph-submissions @.***>, wrote:

https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

May 06 '22 14:05 jreades

So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

I edited the final image in your lesson to show you what it should look like. The last image now renders correctly. I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision.

May 06 '22 14:05 hawc2

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck.

I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong.

Jon

-- mob: 07976987392 email: @.*** skype: jreades On 6 May 2022, 15:34 +0100, Alex Wermer-Colan @.***>, wrote:

So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

May 09 '22 15:05 jreades

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

May 09 '22 15:05 tiagosousagarcia

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

I'll go through the file and correct the paths, give me 5 mins

May 09 '22 15:05 tiagosousagarcia

ignore what I just said, clearly not the problem, I'll revert the changes made in 86a1130

May 09 '22 16:05 tiagosousagarcia

after ed3d398 the images are now displaying (to me, at least), though not sure what was wrong with @jreades earlier attempt then: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

May 09 '22 16:05 tiagosousagarcia

Lesson looks great to me! The site can take a bit to rebuild, @jreades, hopefully that was it. Feel free to keep tweaking, but seems to me you got it all right. I'll start the search for our peer reviewers!

May 09 '22 19:05 hawc2

Just checking there’s nothing needed on our end? I assume it’s a case of trying to find reviewers alongside the rest of the PH workload and you’ll let us know when things are ready, but just in case…

Jun 07 '22 13:06 jreades

Yep you're good. Finding reviewers make take a month or two, especially at this time of the year. You can wait until you hear back from both reviewers and I make some overarching comments, then you'll have a chance to do comprehensive revisions

Jun 07 '22 13:06 hawc2

ph-submissions ph-submissions copied to clipboard

Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA)

ph-submissions
ph-submissions copied to clipboard