dataprep
dataprep copied to clipboard
Feature Proposal: A unified API design for enumerable text types in clean module
Summary
Unified API design for enumerate text types. Note that number types here indicate the data type which can be derived from some knowledge bases, such as country .
Design-level Explanation Actions
- [X] Investigate prior art solutions for cleaning text data.
- [X] Decide which data cleaning operations to support.
- [X] Decide how to specify operations (e.g. function parameters vs. piping methods)
Design-level Explanation
Proposed function signature for any data type belongs to enumerate text type. Note that xxx
stands for the name of data type.
For implementation of each type, we only expose the cleaning function and validation function to users.
def clean_xxx(
df: Union[pd.DataFrame, dd.DataFrame],
col: str,
input_format: str = "auto",
output_format: str,
kb_path: str = "default",
inplace: bool = False,
errors: str = 'coerce',
progress: bool = True,
) -> pd.DataFrame:
"""
Clean xxx type data in a DataFrame column.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be cleaned.
col
The name of the column containing data of xxx type.
input_format
The input format of enumerate text string.
If input_format = 'auto', the clean function will detect the input format automatically.
Otherwise, the input_format should be the format specified by users. Note that for different types, the input_format accepts different string as parameters.
(default: "auto")
output_format
The output format of standardized enumerate string.
(default: )
kb_path
The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing.
(default: "default")
inplace
If True, delete the column containing the data that was cleaned.
Otherwise, keep the original column.
(default: False)
errors
How to handle parsing errors.
- ‘coerce’: invalid parsing will be set to NaN.
- ‘ignore’: invalid parsing will return the input.
- ‘raise’: invalid parsing will raise an exception.
(default: 'coerce')
progress
If True, display a progress bar.
(default: True)
Examples
--------
Clean a column of xxx data.
>>> df = pd.DataFrame({content of DataFrame})
>>> clean_xxx(df, 'col_name')
col_name
0 cleaned_val0
1 cleaned_val1
"""
def validate_xxx(
df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
col: str,
input_format: str = "auto",
kb_path: str = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
"""
Validate xxx type data in a DataFrame column. For each cell, return True or False.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be validated.
col
The name of the column to be validated.
input_format
The input format of enumerate text string.
If input_format = 'auto', the clean function will detect the input format automatically.
Otherwise, the input_format should be the format specified by users. Note that for different types, the input_format accepts different string as parameters.
(default: "auto")
kb_path
The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing. By default, using the knowledge which Dataprep provides.
(default: "default")
"""
def _format(
val: Optional[str],
output_format: str,
) -> Optional[str]:
"""
Reformat a enumerable text string with proper output format.
Parameters
----------
val
The value of enumerable text string.
output_format
The output format of standardized enumerate string.
(default: )
"""
def _load_kb(
kb_path: str = "default",
) -> Any:
"""
Load knowledge base from specified path
Parameters
----------
kb_path
The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing. By default, using the knowledge which Dataprep provides.
(default: "default")
"""
Implementation-level Actions
- [ ] Implement built-in cleaning functions.
- [ ] Implement cleaning pipeline.
- [ ] Add support for user-defined functions.
Additional Tasks
- [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
- [ ] The label of this task is setting correctly.
- [ ] The issue is assigned to the correct person.
- [ ] The issue is linked to related Epic.
- [ ] The documentation is changed accordingly.
- [ ] Tests are added accordingly.