BREAKING FEAT: introduce word-level converter
Description
This PR introduces a new base class called WordLevelConverter, which simplifies the creation of word-level converters by providing a reusable foundation that standardizes word selection for transformation and reduces code duplication across similar converters.
The key benefit is that one only needs to implement the specific word transformation logic (convert_word_async) while the base class handles word selection, iteration, and final result.
Word selection strategies/modes
The base class supports various word selection modes through the select_word_indices util function:
- all - convert/transform every word in the prompt (default)
- keywords - only specific words provided in a keywords list
- random - a random subset of words based on a specified percentage
- regex - words that match a regular expression pattern
- custom - only specifically chosen word indices
List of refactored prompt converters
The following converters have been refactored to use the new base class:
BinaryConverterCharSwapGeneratorEmojiConverterLeetspeakConverterROT13ConverterStringJoinConverterTextToHexConverterUnicodeReplacementConverter
Note: I'm not sure if all the prompt converters that I've refactored should be word-level based, or if there are other converters that haven't been refactored that would benefit from this base class.
Related: https://github.com/Azure/PyRIT/pull/818#issuecomment-2749465002
Tests and Documentation
Updated docs and tests
Thanks for the reviews! I'll make the requested changes and let you know once theyβre all ready (might be a few days π)
Zalgo is merged now so you can add it as well π
@paulinek13 there are still two open comment threads as far as I can see. Let me know if I should elaborate on anything!
@romanlutz Thanks!
Yes, I'm aware of the remaining threads. I haven't resolved them yet because I'm just not quite satisfied with my original approach to initializing the word-level converters π
After giving it some more thought, I believe it might be cleaner to move the selection configuration into separate methods, example:
converter = CharSwapConverter(max_iterations=1).select_random(proportion=0.5)
This seems to reduce repeating things in both docs and code, and could offer some other maintainability benefits as well (like making it easier to add a new converter based on WordLevelConverter and not having to copy-paste the docstring part with the args every time).
So this should address the following: https://github.com/Azure/PyRIT/pull/847#discussion_r2048863743 and https://github.com/Azure/PyRIT/pull/847#discussion_r2048875151
Seems like a nicer pattern overall. I have to say I like it π No mode_kwargs, just methods allowing to change the selection of words:
class WordLevelConverter(PromptConverter):
# ...
@final
def select_keywords(self: T, keywords: List[str]) -> T:
"""Configure the converter to only process words matching specific keywords."""
self._selection_mode = "keywords"
self._selection_keywords = keywords
return self
# ...
I hope it makes sense. I'll push this change shortly for you to review (if it won't be good we can always revert it π)
BTW, This PRβs taking a bit more time than planned, so thanks for bearing with me π
Edit: The related commit: use special methods insetad of kwargs for word selection configuration
The PR is touching a lot of pieces, so it taking a little bit is by no means your fault π Sorry for taking a while to get back to you. It's been a busy few weeks with our Build conference this week.
Regarding the update, I'd rather have all the parameters set in the constructor. It's just what we do everywhere else and keeps it consistent. The duplicated documentation (docstring) is expected and mainly an issue caused by the documentation generation process. We can't really solve that here.
Maybe the way to get rid of mode_kwargs is to put all the args (keywords, regex, etc.) explicitly (instead of as kwargs that are not explicitly mentioned) and thereby avoid the kwargs one?
def __init__(self, mode: str = "all", **mode_kwargs):
would become
def __init__(
self,
*,
indices: Optional[List[int]] = None,
keywords: Optional[List[str]] = None,
proportion: Optional[float] = None,
regex: Optional[Union[str, re.Pattern]] = None,
all is implicit if you don't specify any of the others. So you'd have to check that at most one of these is provided, but that's doable.
Wdyt?
Thanks for the feedback and no worries about the delay -- I hope the conference went well π
Regarding the update, I'd rather have all the parameters set in the constructor. It's just what we do everywhere else and keeps it consistent. The duplicated documentation (docstring) is expected and mainly an issue caused by the documentation generation process. We can't really solve that here.
It makes sense for me now and I agree with it (consistency with the rest of the codebase). I'll implement this approach and add validation to ensure only one selection mode is specified at a time. Just a heads up that I might be a bit slower with updates over the next week or two, but I'll get this change pushed soon!
No problem! Thank you for thinking through ways to make it intuitive. If it wasn't for consistency I might agree that your suggestion is preferable π
The conference went great, thank you π
@romanlutz
Hi, I've made the changes. I've also updated this PR with main and resolved the conflicts :)
@paulinek13 fantastic! I'll review today and will let you know if anything is still missing. Last time, it looked like it's essentially ready, though.