dataprep
dataprep copied to clipboard
Feature Proposal: A unified API design for semantic type inferring
Summary
The unified API design for semantic type inferring, which can be used for clean module and EDA.
Design-level Explanation Actions
- [X] Investigate prior art solutions for cleaning text data.
- [X] Decide which data cleaning operations to support.
- [X] Decide how to specify operations (e.g. function parameters vs. piping methods)
Design-level Explanation
def infer_semantic_type(
df: Union[pd.DataFrame, dd.DataFrame]
) -> Optional[List[str]]:
"""
Infer if the types of columns.
If the type of one column is number + alphabet + separator type, clean column with clean_num.
If the type of one column is text type (including enumerable and unenumerable), clean column with clean_text.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be inferred.
"""
def _infer_num_type(
val: str
) -> Optional[str]:
"""
Infer the semantic type of the column if it is a definitely number + alphabet + separator type.
Parameters
----------
val
The value of string in a cell.
"""
def _infer_text_type(
val: str
) -> Optional[str]:
"""
Infer the semantic type of the column if it is a definitely text type.
Parameters
----------
val
The value of string in a cell.
"""
Implementation-level Explanation
Rational and Alternatives
Prior Art
Future Possibilities
Implementation-level Actions
- [ ] Implement built-in cleaning functions.
- [ ] Implement cleaning pipeline.
- [ ] Add support for user-defined functions.
Additional Tasks
- [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
- [ ] The label of this task is setting correctly.
- [ ] The issue is assigned to the correct person.
- [ ] The issue is linked to related Epic.
- [ ] The documentation is changed accordingly.
- [ ] Tests are added accordingly.