NeMo-Curator
NeMo-Curator copied to clipboard
Add Regex Modifier
Description
Add a modifier that performs regex replacements.
Usage
regex_params = [
{"pattern": "ö", "repl": "o"},
{
"pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]",
"repl": "",
},
]
modifier = RegexModifier(regex_params)
# "Nein, es ist möglich🙃 " -> replacement ö ->
# "Nein, es ist moglich🙃 " -> remove anything other than alphanumeric characters and punctuations
# "Nein, es ist moglich " -> remove extra spaces -> "Nein, es ist moglich"
output = modifier.modify_document("Nein, es ist möglich🙃 ") # returns "Nein, es ist moglich"
Checklist
- [x] I am familiar with the Contributing Guide.
- [x] New or Existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Oh, could you also update the API docs at docs/user-guide/api/modifiers.rst?
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.