NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Add Regex Modifier

Open shuoyangd opened this issue 9 months ago • 1 comments

Description

Add a modifier that performs regex replacements.

Usage

regex_params = [
    {"pattern": "ö", "repl": "o"},
    {
        "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]",
        "repl": "",
    },
]

modifier = RegexModifier(regex_params)
# "Nein, es ist möglich🙃 " -> replacement ö ->
# "Nein, es ist moglich🙃 " -> remove anything other than alphanumeric characters and punctuations
# "Nein, es ist moglich " -> remove extra spaces -> "Nein, es ist moglich"
output = modifier.modify_document("Nein, es ist möglich🙃 ")  # returns "Nein, es ist moglich"

Checklist

  • [x] I am familiar with the Contributing Guide.
  • [x] New or Existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

shuoyangd avatar Feb 24 '25 17:02 shuoyangd

Oh, could you also update the API docs at docs/user-guide/api/modifiers.rst?

ryantwolf avatar Mar 06 '25 17:03 ryantwolf

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar Jul 24 '25 02:07 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 01 '25 02:08 github-actions[bot]