data-prep-kit [Feature] Implement functionality to check for grammar, punctuation, spelling errors in a given text

[Feature] Implement functionality to check for grammar, punctuation, spelling errors in a given text

Open Bytes-Explorer opened this issue 1 year ago • 7 comments

trafficstars

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Implement a new feature to detect and eliminate grammar, punctuation or spelling from a given text. This functionality should work on every row of a parquet file, where every row contains one document. The output should be True or False, and this should be added as an output column along with span of the text where errors are detected along with the corrected text.

This can be added as a new transform for text/NLP data. One can refer to code quality module as a reference for how filters have been applied for code data.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Jul 04 '24 14:07 Bytes-Explorer

@Bytes-Explorer I would like to work on this card. Will take the usual time of two weeks for this card.

Aug 06 '24 13:08 SowmyaLR

Sounds good!

Aug 07 '24 04:08 Bytes-Explorer

@SowmyaLR Please don't hesitate to reach out if you run into any issues related to the framework as you build this transform. How familiar are you with the PDF2Parquet transform ? it creates a row in a parquet from Markdown sections of the document.

Aug 20 '24 12:08 touma-I

Hi @touma-I I have done research around grammar correction. Need to start this transform by this week. Thank you for the information about PDF2Parquet file detail. I need to check on this and will get back for further queries.

Aug 20 '24 13:08 SowmyaLR

Hi @Bytes-Explorer @touma-I I have a few questions about this task

Can the input for this transform contain emojis and table other non-alphabetical characters?
Can the input be in any language(example: French, Hindi, Tamil)?
Each doc in the parquet table will maintain the structure of the original document?(each doc will have the metadata that it belongs to paragrah1 and next paragraph like that)

Aug 22 '24 07:08 SowmyaLR

@Bytes-Explorer @touma-I any updates on the above questions?

Aug 27 '24 04:08 SowmyaLR

Sorry missed this @SowmyaLR

Yes, we can clean for those issues too
We need the solution to support at a minimum English language, but will also be nice to do multi-lingual
For 3, we have two ways of ingesting documents right now, PDF and HTML. You can try out both of them with sample files to get an understanding on what will be structure of the parquet file.

Aug 27 '24 06:08 Bytes-Explorer

data-prep-kit data-prep-kit copied to clipboard

[Feature] Implement functionality to check for grammar, punctuation, spelling errors in a given text

Search before asking

Component

Feature

Are you willing to submit a PR?

data-prep-kit
data-prep-kit copied to clipboard