dataprep
dataprep copied to clipboard
Feature Proposal: A unified API design for unenumerable text types in clean module
Summary
Unified API design for unenumerable text types. Note that unenumerable text types here indicate that the part of the data type is composed by enumerable types and the other part of it is composed by unenumerable strings. The example is URL.
Design-level Explanation Actions
- [X] Investigate prior art solutions for cleaning text data.
- [X] Decide which data cleaning operations to support.
- [X] Decide how to specify operations (e.g. function parameters vs. piping methods)
Design-level Explanation
Proposed function signature for any data type belongs to unenumerable text type. Note that xxx
stands for the name of data type.
For implementation of each type, we only expose the cleaning function and validation function to users.
def clean_xxx(
df: Union[pd.DataFrame, dd.DataFrame],
col: str,
output_format: str,
kb_pathes: Union[str, dict] = "default",
split: bool = False,
inplace: bool = False,
errors: str = 'coerce',
progress: bool = True,
) -> pd.DataFrame:
"""
Clean xxx type data in a DataFrame column.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be cleaned.
col
The name of the column containing data of xxx type.
output_format
The output format of standardized enumerate string.
(default: )
kb_pathes
The path of user specified knowledge base for included enumerable type functions. In current stage, it should be in the user's local directory following by the format we proposing. By default, we use the knowledge inner Dataprep.
(default: "default")
split
If True, each component of derived from its number string will be put into its own column.
For example, if split = True for URL, then there will be several columns:
- scheme
- host
- domain
- ......
(default: False)
inplace
If True, delete the column containing the data that was cleaned.
Otherwise, keep the original column.
(default: False)
errors
How to handle parsing errors.
- ‘coerce’: invalid parsing will be set to NaN.
- ‘ignore’: invalid parsing will return the input.
- ‘raise’: invalid parsing will raise an exception.
(default: 'coerce')
progress
If True, display a progress bar.
(default: True)
Examples
--------
Clean a column of xxx data.
>>> df = pd.DataFrame({content of DataFrame})
>>> clean_xxx(df, 'col_name')
col_name
0 cleaned_val0
1 cleaned_val1
"""
def validate_xxx(
df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
col: str,
kb_path: Union[str, dict] = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
"""
Validate xxx type data in a DataFrame column. For each cell, return True or False.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be validated.
col
The name of the column to be validated.
kb_pathes
The path of user specified knowledge base for included enumerable type functions. In current stage, it should be in the user's local directory following by the format we proposing. By default, we use the knowledge inner Dataprep.
(default: "default")
"""
def _split(
val: Optional[str]
) -> Optional[dict]:
"""
Split unenumerate text into several parts according to rules like regex, including enumerable part and unenumerable part.
Parameters
----------
val
The value of unenumerable text string.
"""
def _format(
val: Optional[str],
output_format: str,
) -> Optional[str]:
"""
Reformat each enumerable part string with proper output format.
Parameters
----------
val
The value of enumerable text string.
output_format
The output format of standardized enumerate string.
(default: )
"""
Implementation-level Actions
- [ ] Implement built-in cleaning functions.
- [ ] Implement cleaning pipeline.
- [ ] Add support for user-defined functions.
Additional Tasks
- [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
- [ ] The label of this task is setting correctly.
- [ ] The issue is assigned to the correct person.
- [ ] The issue is linked to related Epic.
- [ ] The documentation is changed accordingly.
- [ ] Tests are added accordingly.