dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: A unified API design for semantic type inferring

Open qidanrui opened this issue 3 years ago • 0 comments

Summary

The unified API design for semantic type inferring, which can be used for clean module and EDA.

Design-level Explanation Actions

  • [X] Investigate prior art solutions for cleaning text data.
  • [X] Decide which data cleaning operations to support.
  • [X] Decide how to specify operations (e.g. function parameters vs. piping methods)

Design-level Explanation


def infer_semantic_type(
    df: Union[pd.DataFrame, dd.DataFrame]
) -> Optional[List[str]]:
   """ 
       Infer if the types of columns.
       If the type of one column is number + alphabet + separator type, clean column with clean_num.
       If the type of one column is text type (including enumerable and unenumerable), clean column with clean_text.
       
       Parameters
       ----------
       df
            A pandas or Dask DataFrame containing the data to be inferred.
   """

def _infer_num_type(
    val: str
) -> Optional[str]:
   """ 
       Infer the semantic type of the column if it is a definitely number + alphabet + separator type.
       Parameters
       ----------
       val
            The value of string in a cell.
   """

def _infer_text_type(
    val: str
) -> Optional[str]:
   """ 
      Infer the semantic type of the column if it is a definitely text type.
      Parameters
      ----------
      val
           The value of string in a cell.
   """

Implementation-level Explanation

Rational and Alternatives

Prior Art

Future Possibilities

Implementation-level Actions

  • [ ] Implement built-in cleaning functions.
  • [ ] Implement cleaning pipeline.
  • [ ] Add support for user-defined functions.

Additional Tasks

  • [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [ ] The label of this task is setting correctly.
  • [ ] The issue is assigned to the correct person.
  • [ ] The issue is linked to related Epic.
  • [ ] The documentation is changed accordingly.
  • [ ] Tests are added accordingly.

qidanrui avatar Jun 15 '21 03:06 qidanrui