data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Implement functionality to check for grammar, punctuation, spelling errors in a given text

Open Bytes-Explorer opened this issue 1 year ago • 7 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Implement a new feature to detect and eliminate grammar, punctuation or spelling from a given text. This functionality should work on every row of a parquet file, where every row contains one document. The output should be True or False, and this should be added as an output column along with span of the text where errors are detected along with the corrected text.

This can be added as a new transform for text/NLP data. One can refer to code quality module as a reference for how filters have been applied for code data.

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Bytes-Explorer avatar Jul 04 '24 14:07 Bytes-Explorer

@Bytes-Explorer I would like to work on this card. Will take the usual time of two weeks for this card.

SowmyaLR avatar Aug 06 '24 13:08 SowmyaLR

Sounds good!

Bytes-Explorer avatar Aug 07 '24 04:08 Bytes-Explorer

@SowmyaLR Please don't hesitate to reach out if you run into any issues related to the framework as you build this transform. How familiar are you with the PDF2Parquet transform ? it creates a row in a parquet from Markdown sections of the document.

touma-I avatar Aug 20 '24 12:08 touma-I

Hi @touma-I I have done research around grammar correction. Need to start this transform by this week. Thank you for the information about PDF2Parquet file detail. I need to check on this and will get back for further queries.

SowmyaLR avatar Aug 20 '24 13:08 SowmyaLR

Hi @Bytes-Explorer @touma-I I have a few questions about this task

  1. Can the input for this transform contain emojis and table other non-alphabetical characters?
  2. Can the input be in any language(example: French, Hindi, Tamil)?
  3. Each doc in the parquet table will maintain the structure of the original document?(each doc will have the metadata that it belongs to paragrah1 and next paragraph like that)

SowmyaLR avatar Aug 22 '24 07:08 SowmyaLR

@Bytes-Explorer @touma-I any updates on the above questions?

SowmyaLR avatar Aug 27 '24 04:08 SowmyaLR

Sorry missed this @SowmyaLR

  1. Yes, we can clean for those issues too
  2. We need the solution to support at a minimum English language, but will also be nice to do multi-lingual
  3. For 3, we have two ways of ingesting documents right now, PDF and HTML. You can try out both of them with sample files to get an understanding on what will be structure of the parquet file.

Bytes-Explorer avatar Aug 27 '24 06:08 Bytes-Explorer