edx-platform feat: add feature flag to enable legacy md5 hash for anonymous user id

Description

PR #26198 changed the hash algorithm to generate an anonymous user id from MD5 to SHAKE-128.

However, there are Open edX operators who maintain backward compatibility of anonymous user IDs after past rotations of their Django secret key. For them, altering the hashing algorithm was a breaking change that made their analytics inconsistent.

This PR, introduces a feature flag, to enable Open edX operators to switch back to the legacy MD5 algorithm if they so require.

Supporting information

Link to other information about the change, such as Jira issues, GitHub issues, or Discourse discussions. Be sure to check they are publicly readable, or if not, repeat the information here.

Testing instructions

Checkout this branch in devstack
Browse any course.
Open LMS shell and retrieve the anonymous user id for the user/course using

from common.djangoapps.student.models import AnonymousUserId
anonymous_user_ids = AnonymousUserId.objects.filter(user=8).filter(course_id='course-v1:edX+DemoX+Demo_Course').order_by('-id')

Note this anonymous id down
Delete the anonymous user id

anonymous_user_ids[0].delete()

Open /edx/etc/lms.yml
Under Features add the new Flag

FEATURES:
    ENABLE_LEGACY_MD5_HASH_FOR_ANONYMOUS_USER_ID: true

Rerun steps 2 to 4.
Verify the anonymous user ids are different

Deadline

"None" if there's no rush, or provide a specific date or event (and reason) if there is one.

Other information

Private Ref: BB-6533

Aug 08 '22 15:08 kaustavb12

Thanks for the pull request, @kaustavb12! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

supporting documentation
Open edX discussion forum threads
timeline information ("this must be merged by XX date", and why that is)
partner information ("this is a course on edx.org")
any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

Aug 08 '22 15:08 openedx-webhooks

@natabene, this is ready for your review.

Aug 09 '22 17:08 Agrendalath

@kaustavb12 Thank you for your contribution.

Aug 09 '22 17:08 natabene

@kaustavb12 @Agrendalath I don't really understand the need for this change. As I understand, all past anonymous IDs are stored in a database table and won't be changed, even though new ones use a new algorithm. So why can't the operator here load the IDs from the database table for their analytics instead of depending on the details of the hash algorithm?

Oct 21 '22 18:10 bradenmacdonald

@bradenmacdonald, this operator is using a custom analytics system that is integrated with multiple external services. As far as I know, some of these services were using direct user IDs in the past. Therefore, the operator calculates MD5 hashes directly in their database and stores them to maintain the integrity of all existing data. There was a key rotation in the past that changed the anonymous user IDs, so they also have lots of legacy code to make these IDs backward compatible with all possible versions.

Hashes are calculated in the database, but database engines are not adapting new hashing functions too fast. For example, SHA-2 was accepted in 2002 (FIPS PUB 180-2), but it appeared in MySQL in 2010 (v5.5.5), and PostgreSQL didn't fully support its variants until 2018. As SHA-3 (FIPS PUB 202) was published in August 2015, it's not likely to see it fully supported soon. Therefore, it would be necessary to pre-calculate these hashes, instead of preparing them directly in the DB query. We don't have details about how exactly these analytics work, though, as we're not managing them, nor even have any access to the code.

Oct 24 '22 11:10 Agrendalath

@Agrendalath Thanks, though I am still still having trouble understanding why there's any need to calculate hashes within a DMBS at all. Can't there just be a table in every database/analytics system that needs one, which contains the map of real_id, anyonmous_id for all anonymous ids that have ever been used? i.e. use JOIN/lookups instead of hashing/calculations.

In other words, the root problem here is that the external systems were integrated in a way that was too highly coupled to the internal details of how Open edX calculates its anonymous IDs. So it seems like the right fix is to un-couple them from that, rather than add complexity to Open edX by supporting two hash mechanisms. However, I won't block the PR on it - I see it's only a couple lines of code in the end.

Oct 24 '22 17:10 bradenmacdonald

I share Braden's concerns -- this seems like a really unusual use-case, and I'd want to have a clearer picture of why this can't be solved with table lookups or a sync job before I'd be comfortable adding a toggle that enables an old hash algorithm.

(It also sounds like the analytics system might need to have a copy of all current and former SECRET_KEY values in order to operate, which seems like it could be a security concern.)

Oct 26 '22 01:10 timmc-edx

Thanks @timmc-edx.

We are discussing this change internally as well as with our client to see if the approach can be modified based on suggestions from you and @bradenmacdonald. I'll keep you posted on the same.

Nov 04 '22 04:11 kaustavb12

@timmc-edx

Based on your and @bradenmacdonald's suggestions, we had a discussion with our client on the approach to follow here. And it was decided that it would be best for the analytics system to pull the required data from DB rather than calculating the hashes using the SECRET_KEY.

Therefore we are closing this PR. Thanks a lot to you and @bradenmacdonald for your reviews and suggestions.

cc. @Agrendalath

Nov 17 '22 13:11 kaustavb12

@kaustavb12 Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future.

Nov 17 '22 13:11 openedx-webhooks

edx-platform edx-platform copied to clipboard

feat: add feature flag to enable legacy md5 hash for anonymous user id

Description

Supporting information

Testing instructions

Deadline

Other information

edx-platform
edx-platform copied to clipboard