Empirical-Core icon indicating copy to clipboard operation
Empirical-Core copied to clipboard

script to deduplicate question uids

Open emilia-friedberg opened this issue 9 months ago • 0 comments

WHAT

Add a script that:

  • identifies pairs of questions with (case-insensitive) identical uids in the LMS database
  • identifies, for each pair, which question has fewer responses in the CMS
  • create duplicates of all responses for the identified questions in the CMS with a new uid
  • creates duplicates of the questions themselves in the LMS, with the new uid we established in the CMS
  • replaces the old question key with the new question key for any activities in the LMS
  • archives the old questions

WHY

This issue was causing a weird experience for curriculum team members, who were seeing responses for two different questions on the afflicted questions in Grammar, and ultimately led to some issues where students got stuck because a curriculum team member thought they were just cleaning up old data and deleted optimal answers.

HOW

More or less followed the steps in the RFC, with a couple of deviations due to the fact that I didn't realize just how many responses would be impacted if we updated both questions in the duplicated pair (almost 6 million). Instead, I opted to replace only the question and respective responses that had fewer responses in the database -- some of these are still very large but in total this script will create just over 1 million responses. I also decided to create duplicate responses, rather than just update the existing ones, in the interests of being able to go back to using the old ones more easily if something were to go wrong. That does mean that we'll want to go back and delete the old responses at some point after this script has been run, though, because otherwise for the half of the questions we aren't replacing, curriculum team members would still see irrelevant responses (though nothing bad would happen if they deleted them). By creating these duplicates instead of updating the old ones, we also don't need to put all the questions in alpha before we start this process, so we no longer run the risk of interrupting the student experience.

Most of my time working on this was spent trying to figure out how to make the response creation not take a million years. insert_all is very useful in this case!

Screenshots

(If applicable. Also, please censor any sensitive data)

Notion Card Links

https://www.notion.so/quill/Deduplicate-Question-UIDs-in-Grammar-aa23d0b06e15447e9c77abd6ede5f71e?pvs=4

What have you done to QA this feature?

Run the script on my local database

PR Checklist Your Answer
Have you added and/or updated tests? N/A
Have you deployed to Staging? NO - saving staging testing for after this has passed code review because it's annoying, though possible, to reverse
Self-Review: Have you done an initial self-review of the code below on Github? YES

emilia-friedberg avatar May 09 '24 18:05 emilia-friedberg