enso Add data cleanse component

Pull Request Description

Add new cleanse and text_cleanse components

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

[x] The documentation has been updated, if necessary.
[x] Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
[x] All code follows the Scala, Java, TypeScript, and Rust style guides. In case you are using a language not listed above, follow the Rust style guide.
[x] Unit tests have been written where possible.

May 07 '24 15:05 AdRiley

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz')

although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

May 16 '24 09:05 somebody1234

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz')

although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

So operator78170.replace (regex "\d") ' ' exists today. But do you mean the ability to use our named regexs in replace?

That is an interesting idea...

May 16 '24 09:05 AdRiley

would we perhaps want a input.replace "\d" " " cleanse too? (like Duplicate_Whitespace, except always replacing with spaces - for example, 'foo bar\t\t baz') although it probably wouldn't be too late to only add that only when someone actually has a usecase for it...

So operator78170.replace (regex "\d") ' ' exists today. But do you mean the ability to use our named regexs in replace?

That is an interesting idea...

I understood as ability for the regex to replace the number not with empty "", but with single whitespace " ". Not sure if that was what you meant @somebody1234 ?

But I'm writing because this also struck a chord with me - I was thinking that with this method when cleaning e.g. [a,b,c] from all non-letters, I will get abc. Often that is what I want. But it feels to me that I may also want to get a b c. For example for language processing tasks, if I want to do some naive cleanup of punctuation before tokenization, I want foo:bar, baz... Hmm? to probably become foo bar baz Hmm, so that I can then split it on " " to get all the tokens. (Of course if foo:bar should become foo bar or actually foobar is highly use-case dependent.)

But essentially this stems the idea if maybe we should be able to control if the cleansing should "preserve separation between words". I.e. by default we replace everything with "", but we could have an alternative mode where we kind of replace everything with " " and then do remove duplicate whitespace as the last step to normalize all separators to be a single space.

But it feels like this is complicating this rather simple tool, so maybe that is not really what we want at this stage for this component. Just throwing ideas around.

May 16 '24 09:05 radeusgd

whoops my bad, i meant "\s+" " " :sweat_smile:

May 16 '24 10:05 somebody1234

enso enso copied to clipboard

Add data cleanse component

Pull Request Description

Important Notes

Checklist

enso
enso copied to clipboard