PyRIT BREAKING FEAT: introduce word-level converter

Description

This PR introduces a new base class called WordLevelConverter, which simplifies the creation of word-level converters by providing a reusable foundation that standardizes word selection for transformation and reduces code duplication across similar converters.

The key benefit is that one only needs to implement the specific word transformation logic (convert_word_async) while the base class handles word selection, iteration, and final result.

Word selection strategies/modes

The base class supports various word selection modes through the select_word_indices util function:

all - convert/transform every word in the prompt (default)
keywords - only specific words provided in a keywords list
random - a random subset of words based on a specified percentage
regex - words that match a regular expression pattern
custom - only specifically chosen word indices

List of refactored prompt converters

The following converters have been refactored to use the new base class:

BinaryConverter
CharSwapGenerator
EmojiConverter
LeetspeakConverter
ROT13Converter
StringJoinConverter
TextToHexConverter
UnicodeReplacementConverter

Note: I'm not sure if all the prompt converters that I've refactored should be word-level based, or if there are other converters that haven't been refactored that would benefit from this base class.

Related: https://github.com/Azure/PyRIT/pull/818#issuecomment-2749465002

Tests and Documentation

Updated docs and tests

Mar 31 '25 15:03 paulinek13

Thanks for the reviews! I'll make the requested changes and let you know once they’re all ready (might be a few days 😃)

Apr 17 '25 16:04 paulinek13

Zalgo is merged now so you can add it as well 🙂

Apr 27 '25 23:04 romanlutz

@paulinek13 there are still two open comment threads as far as I can see. Let me know if I should elaborate on anything!

Apr 29 '25 02:04 romanlutz

@romanlutz Thanks!

Yes, I'm aware of the remaining threads. I haven't resolved them yet because I'm just not quite satisfied with my original approach to initializing the word-level converters 🙂

After giving it some more thought, I believe it might be cleaner to move the selection configuration into separate methods, example:

converter = CharSwapConverter(max_iterations=1).select_random(proportion=0.5)

This seems to reduce repeating things in both docs and code, and could offer some other maintainability benefits as well (like making it easier to add a new converter based on WordLevelConverter and not having to copy-paste the docstring part with the args every time).

So this should address the following: https://github.com/Azure/PyRIT/pull/847#discussion_r2048863743 and https://github.com/Azure/PyRIT/pull/847#discussion_r2048875151

Seems like a nicer pattern overall. I have to say I like it 😄 No mode_kwargs, just methods allowing to change the selection of words:

class WordLevelConverter(PromptConverter):
	# ...
    @final
    def select_keywords(self: T, keywords: List[str]) -> T:
        """Configure the converter to only process words matching specific keywords."""
        self._selection_mode = "keywords"
        self._selection_keywords = keywords
        return self
	# ...

I hope it makes sense. I'll push this change shortly for you to review (if it won't be good we can always revert it 😃)

BTW, This PR’s taking a bit more time than planned, so thanks for bearing with me 😅

Apr 29 '25 08:04 paulinek13

The PR is touching a lot of pieces, so it taking a little bit is by no means your fault 🙂 Sorry for taking a while to get back to you. It's been a busy few weeks with our Build conference this week.

Regarding the update, I'd rather have all the parameters set in the constructor. It's just what we do everywhere else and keeps it consistent. The duplicated documentation (docstring) is expected and mainly an issue caused by the documentation generation process. We can't really solve that here.

Maybe the way to get rid of mode_kwargs is to put all the args (keywords, regex, etc.) explicitly (instead of as kwargs that are not explicitly mentioned) and thereby avoid the kwargs one?

def __init__(self, mode: str = "all", **mode_kwargs):

would become

def __init__(
    self,
    *,
    indices: Optional[List[int]] = None,
    keywords: Optional[List[str]] = None,
    proportion: Optional[float] = None,
    regex: Optional[Union[str, re.Pattern]] = None,

all is implicit if you don't specify any of the others. So you'd have to check that at most one of these is provided, but that's doable.

Wdyt?

May 23 '25 18:05 romanlutz

Thanks for the feedback and no worries about the delay -- I hope the conference went well 😃

Regarding the update, I'd rather have all the parameters set in the constructor. It's just what we do everywhere else and keeps it consistent. The duplicated documentation (docstring) is expected and mainly an issue caused by the documentation generation process. We can't really solve that here.

It makes sense for me now and I agree with it (consistency with the rest of the codebase). I'll implement this approach and add validation to ensure only one selection mode is specified at a time. Just a heads up that I might be a bit slower with updates over the next week or two, but I'll get this change pushed soon!

May 30 '25 20:05 paulinek13

No problem! Thank you for thinking through ways to make it intuitive. If it wasn't for consistency I might agree that your suggestion is preferable 🙂

The conference went great, thank you 🙂

May 30 '25 21:05 romanlutz

@romanlutz Hi, I've made the changes. I've also updated this PR with main and resolved the conflicts :)

Jun 08 '25 15:06 paulinek13

@paulinek13 fantastic! I'll review today and will let you know if anything is still missing. Last time, it looked like it's essentially ready, though.

Jun 08 '25 15:06 romanlutz