Add Waffle flags to roll out extracted XBlocks
Once a block in https://github.com/openedx/xblocks-contrib is ready, we will need to safely roll it out in edx-platform. To do that safely, we need to be able to toggle between the old and new block quickly without putting course content or user state at risk. Unfortunately, I do not think we could easily toggle between blocks in setup.py.
However, what we can do (without touching setup.py at all) is define a Waffle flag for each block, which toggles between the built-in class and the extracted class. Let's use the WordCloud block as an example:
-
Define a stub WordCloudBlock class in xblocks-contrib.
- Include a class constant attribute:
is_extracted = True. More on this later.
- Include a class constant attribute:
-
Create a new module
xmodule/toggles.py:- Define a Waffle flag for the extractable block, for example:
# .. toggle_name: USE_EXTRACTED_WORD_CLOUD_BLOCK # .. toggle_description: ... # .. toggle_warning: This ls not production ready until <link to extraction GitHub issue> is closed # .. toggle_use_cases: temporary # .. etc. USE_EXTRACTED_WORD_CLOUD_BLOCK = WaffleFlag('xmodule.use_extracted_block.word_cloud', __name__) -
In the builtin XBlock definition module, for example
xmodule/word_cloud_block.py:- At the top of the module, import the extracted block (even if it still a work-in-progress):
from xblocks_contrib import WordCloudBlock as _ExtractedWordCloudBlock - Change
class WordCloudBlocktoclass _BuiltInWordCloudBlock - In
_BuiltInWordCloudBlock, add a new constant class attribute:is_extracted = False. More on this later. - At the end of the file, conditionally set the class to either the extracted or the builtin block class:
WordCloudBlock = ( _ExractedWordCloudBLock if USE_EXTRACTED_WORD_CLOUD_BLOCK.is_enabled() else _BuiltInWordCloudBlock )
- At the top of the module, import the extracted block (even if it still a work-in-progress):
Do this for all component XBlocks:
- PollBlock
- WordCloudBlock
- AnnotatableBlock
- LtiBlock
- HtmlBlock
- DiscussionBlock
- ProblemBLock
- VideoBlock
Lastly, make use of the is_extracted class attribute you added earlier by adding a custom monitoring attribute to the render_xblock view, which will show up in site operator's monitoring systems and allow them to isolate any errors and warnings which come from extracted blocks:
if (is_extracted := getattr(XBlock.load_class(usage_key.block_type), 'is_extracted') is not None:
set_custom_attribute('block_extracted', is_extracted)
Once all of that is merged, we can toggle between each builtin block and its extracted counterpart in development. Also, edX.org can toggle it in production when they are ready, giving them the ability to test this ahead of time.
Only once an extracted block is proven to be stable, we can truly replace its built-in block:
- in a new xblocks-contrib major version: add the entry-point to setup.py
- in a single edx-platform PR: upgrade xblocks-contrib, remove the flag, remove the built-in block definition.
FYI @farhan and @feanil . Once an extracted block is ready to go, I think this is how we should roll it out. Let me know if you disagree.
@kdmccormick thanks for brainstorming the solution, it seems interesting to me, some points:
- What entry point name of Extracted WordCloud XBlock are we going to install into
edx-platform? Currently I have kept the entry point same. a. If both entry points will be same xblock load class will give AmbigousPluginError. b. I think we can keep different entry point of extracted xblock at start and change it to original while removing_BuiltInWordCloudBlock. It will require pushing a new version ofxblocks-contribon pypi. - In persuit of minimal
edx-platformI think we shouldn't ship the non-basic xblocks out of the box. Clients could install it later. Or we can think on these lines further.
@farhan
- My idea is that there is still just one entrypoint:
The point of the waffle flag is to swap between two different values of"word_cloud = xmodule.word_cloud_block:WordCloudBlock",xmodule.word_cloud_block:WordCloudBlock. When the flag is off, it gets assigned to_BuiltInWordCloudBlock. When the flag is on, it gets assigned to_ExtractedWordCloudBlock. The benefit is that site operators can switch back and forth between the old and new implementations for their existing content without redeploying their site, helping them minimize their MTTR (mean time to recovery) when they identify an issue with the new blocks. - We should absolutely pursue a minimal edx-platform but I do not think that means we have to ship all builtin blocks at once. ProblemBlock and VideoBlock will each take a much longer time than the other builtin blocks. Under release early release often philosophy, we should get the simpler blocks out to production as soon as possible, for several reasons:
- A tighter feedback loop: If there is a flaw in how we are extracting xblocks that can only be uncovered in a prod environment, we want to learn that as soon as possible.
- Isolated releases: When there is a bug, it will be easier to identify if we are incorporating a single new block rather than a whole suite of new blocks.
- Demonstrating value: If you can provide some new features or architectural benefit, it better to provide that as soon as possible, rather than waiting to deliver the value all at once when everything is done.
- Maintenance: Putting code in production forces it to be in working order. It is easy to let dead or half-baked code languish when it is never being used on anyone's site.
- Team morale: Shipping to production feels good! We should do it whenever we can.
Is the idea here that there is safety from data races because the waffle flag is loaded in the same transaction as the xblock DB accesses? (Are there any caches we need to worry about?)
@timmc-edx
Is the idea here that there is safety from data races because the waffle flag is loaded in the same transaction as the xblock DB accesses?
Nope, I hadn't thought about data races until you mentioned them. I don't think we need to worry about them, since the data model and data access patterns should be unchanged between the built-in and extracted block implementations. Same with the caches-- if we have done our work right, it will just be a different class accessing the same caches in the same exact way.
FYI @farhan @robrap -- I had originally said that we need CourseWaffleFlags so that individual course runs could switch back and forth between the blocks. I was wrong about that: the set-up I describe would globally toggle between the extracted block and the builtin block, so we should just use regular WaffleFlags.
Toggling between block classes per course run would be significantly more complex to implement and, in my opinion, would introduce operational risks in itself. Standard WaffleFlags would still permit operators to use user-by-user rollout or percentage-based rollout, so I think they are sufficient.
@kdmccormick: That makes sense. Agreed that percent roll-out is what is most important. Could you also add one (or more) custom attributes that would be useful for a dashboard to show which is being used? See https://github.com/openedx/edx-django-utils/issues/328.
@robrap as we discussed, I've added a note for the team to set a custom attribute. The key will be block_extracted and the values will be True, False, or None (if N/A). Let me know if any of that seems off.
Instead of a special value like None, I think the attribute name (or key) should not be added if not needed. Does that makes sense?
@kdmccormick:
- Repeating my last comment with different words: the custom attribute
block_extractedshould only have values of True/False, and should not be added at all where not applicable. - We noted in the meeting the existing 2 xblock-related custom attributes that would also be useful for observability. What were these again?
- We also discussed a process for overridden XBlocks that may need to reuse your waffle flag in order to re-implement the same conditional (and possibly the new custom attribute).
- Will you eventually turn all of this into a document for rollout? Could this apply to other orgs that might have overridden XBlocks?
FYI: @nsprenkle: Regarding our overridden XBlocks.
@robrap
- Yep, understood, I had True-False-None because it was one less line of code, but since it sounds like True-False-N/A is actually better for monitoring, I've updated the ticket to reflect that.
block_typeandusage_key- and 4. Yep, when first flag is ready to be used, I'll put together a doc. The XBlock overrides feature is brand new (landed last month) but it's possible that orgs will start using it as soon as Sumac is released.
Thanks @kdmccormick.
- Yep, understood, I had True-False-None because it was one less line of code, but since it sounds like True-False-N/A is actually better for monitoring, I've updated the ticket to reflect that.
I hadn't realized you had an implementation and that the former was simpler. For monitoring, really anything that works is fine. Feel free to do whichever, and I can verify that it works as planned (e.g. we can easily tell the difference between the 3 cases, or by default, at least 2 of the cases).
@robrap We've had to adjust the acceptance criteria here a bit. Rather than using Waffle flags, we'll be using Django settings. The reason for the change is that Waffle flags require Model access, which Django does not make available until after all INSTALLED_APPS are loaded; but, loading INSTALLED_APPS causes the xmodule/ modules to be imported, which invoked the Waffle flags, thus creating a circular dependency. Django settings, on the other hand, do not have this limitation; they are available to use very early in the Django initialization process.
From earlier:
Standard WaffleFlags would still permit operators to use user-by-user rollout or percentage-based rollout, so I think they are sufficient.
Last message:
Rather than using Waffle flags, we'll be using Django settings.
Although this may be for good reasons, I'm simply clarifying the consequence that user-by-user or percentage-based rollout will no longer be available, and this will now be an all-or-nothing configuration (presumably per block?).
That said - I stepped into this conversation: 1) to provide some monitoring tips, and 2) because I'm a busybody that gets myself involved in other people's business. :) I'm going to drop from this ticket, in favor of someone at 2U who owns this. I will make sure someone is aware of this. Thank you @kdmccormick. (Of course if there are questions that require my specific expertise, please feel free to ping me.)
Although this may be for good reasons, I'm simply clarifying the consequence that user-by-user or percentage-based rollout will no longer be available, and this will now be an all-or-nothing configuration (presumably per block?).
Yes, exactly right.
Final inform: TNL is generally tracking this and not seeing any issues. Thanks again @kdmccormick, and good luck.