edx-platform
edx-platform copied to clipboard
feat: add feature flag to enable legacy md5 hash for anonymous user id
Description
PR #26198 changed the hash algorithm to generate an anonymous user id from MD5 to SHAKE-128.
However, there are Open edX operators who maintain backward compatibility of anonymous user IDs after past rotations of their Django secret key. For them, altering the hashing algorithm was a breaking change that made their analytics inconsistent.
This PR, introduces a feature flag, to enable Open edX operators to switch back to the legacy MD5 algorithm if they so require.
Supporting information
Link to other information about the change, such as Jira issues, GitHub issues, or Discourse discussions. Be sure to check they are publicly readable, or if not, repeat the information here.
Testing instructions
- Checkout this branch in devstack
- Browse any course.
- Open LMS shell and retrieve the anonymous user id for the user/course using
from common.djangoapps.student.models import AnonymousUserId
anonymous_user_ids = AnonymousUserId.objects.filter(user=8).filter(course_id='course-v1:edX+DemoX+Demo_Course').order_by('-id')
- Note this anonymous id down
- Delete the anonymous user id
anonymous_user_ids[0].delete()
- Open
/edx/etc/lms.yml
- Under
Features
add the new Flag
FEATURES:
ENABLE_LEGACY_MD5_HASH_FOR_ANONYMOUS_USER_ID: true
- Rerun steps 2 to 4.
- Verify the anonymous user ids are different
Deadline
"None" if there's no rush, or provide a specific date or event (and reason) if there is one.
Other information
Private Ref: BB-6533
Thanks for the pull request, @kaustavb12! Please note that it may take us up to several weeks or months to complete a review and merge your PR.
Feel free to add as much of the following information to the ticket as you can:
- supporting documentation
- Open edX discussion forum threads
- timeline information ("this must be merged by XX date", and why that is)
- partner information ("this is a course on edx.org")
- any other information that can help Product understand the context for the PR
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.
Please let us know once your PR is ready for our review and all tests are green.
@natabene, this is ready for your review.
@kaustavb12 Thank you for your contribution.
@kaustavb12 @Agrendalath I don't really understand the need for this change. As I understand, all past anonymous IDs are stored in a database table and won't be changed, even though new ones use a new algorithm. So why can't the operator here load the IDs from the database table for their analytics instead of depending on the details of the hash algorithm?
@bradenmacdonald, this operator is using a custom analytics system that is integrated with multiple external services. As far as I know, some of these services were using direct user IDs in the past. Therefore, the operator calculates MD5 hashes directly in their database and stores them to maintain the integrity of all existing data. There was a key rotation in the past that changed the anonymous user IDs, so they also have lots of legacy code to make these IDs backward compatible with all possible versions.
Hashes are calculated in the database, but database engines are not adapting new hashing functions too fast. For example, SHA-2 was accepted in 2002 (FIPS PUB 180-2), but it appeared in MySQL in 2010 (v5.5.5), and PostgreSQL didn't fully support its variants until 2018. As SHA-3 (FIPS PUB 202) was published in August 2015, it's not likely to see it fully supported soon. Therefore, it would be necessary to pre-calculate these hashes, instead of preparing them directly in the DB query. We don't have details about how exactly these analytics work, though, as we're not managing them, nor even have any access to the code.
@Agrendalath Thanks, though I am still still having trouble understanding why there's any need to calculate hashes within a DMBS at all. Can't there just be a table in every database/analytics system that needs one, which contains the map of real_id, anyonmous_id for all anonymous ids that have ever been used? i.e. use JOIN/lookups instead of hashing/calculations.
In other words, the root problem here is that the external systems were integrated in a way that was too highly coupled to the internal details of how Open edX calculates its anonymous IDs. So it seems like the right fix is to un-couple them from that, rather than add complexity to Open edX by supporting two hash mechanisms. However, I won't block the PR on it - I see it's only a couple lines of code in the end.
I share Braden's concerns -- this seems like a really unusual use-case, and I'd want to have a clearer picture of why this can't be solved with table lookups or a sync job before I'd be comfortable adding a toggle that enables an old hash algorithm.
(It also sounds like the analytics system might need to have a copy of all current and former SECRET_KEY
values in order to operate, which seems like it could be a security concern.)
Thanks @timmc-edx.
We are discussing this change internally as well as with our client to see if the approach can be modified based on suggestions from you and @bradenmacdonald. I'll keep you posted on the same.
@timmc-edx
Based on your and @bradenmacdonald's suggestions, we had a discussion with our client on the approach to follow here. And it was decided that it would be best for the analytics system to pull the required data from DB rather than calculating the hashes using the SECRET_KEY
.
Therefore we are closing this PR. Thanks a lot to you and @bradenmacdonald for your reviews and suggestions.
cc. @Agrendalath
@kaustavb12 Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future.