dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: A unified API design for enumerable text types in clean module

Open qidanrui opened this issue 3 years ago • 0 comments

Summary

Unified API design for enumerate text types. Note that number types here indicate the data type which can be derived from some knowledge bases, such as country .

Design-level Explanation Actions

  • [X] Investigate prior art solutions for cleaning text data.
  • [X] Decide which data cleaning operations to support.
  • [X] Decide how to specify operations (e.g. function parameters vs. piping methods)

Design-level Explanation

Proposed function signature for any data type belongs to enumerate text type. Note that xxx stands for the name of data type. For implementation of each type, we only expose the cleaning function and validation function to users.

def clean_xxx(
    df: Union[pd.DataFrame, dd.DataFrame],
    col: str, 
    input_format: str = "auto",
    output_format: str,
    kb_path: str = "default",
    inplace: bool = False,
    errors: str = 'coerce',
    progress: bool = True,
) -> pd.DataFrame:
"""
    Clean xxx type data in a DataFrame column.

    Parameters
    ----------
        df
            A pandas or Dask DataFrame containing the data to be cleaned.
        col
            The name of the column containing data of xxx type.
        input_format
            The input format of enumerate text string.
            If input_format = 'auto', the clean function will detect the input format automatically.
            Otherwise, the input_format should be the format specified by users. Note that for different types, the input_format accepts different string as parameters.

            (default: "auto") 
        output_format
            The output format of standardized enumerate string.

            (default: ) 
        kb_path
            The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing.
            
            (default: "default")
        inplace        
           If True, delete the column containing the data that was cleaned. 
           Otherwise, keep the original column.        

           (default: False)    
        errors        
            How to handle parsing errors.            
            - ‘coerce’: invalid parsing will be set to NaN.           
            - ‘ignore’: invalid parsing will return the input.            
            - ‘raise’: invalid parsing will raise an exception.        
           
            (default: 'coerce')     
        progress        
            If True, display a progress bar.        
           
            (default: True)

    Examples
    --------
    Clean a column of xxx data.

    >>> df = pd.DataFrame({content of DataFrame})
    >>> clean_xxx(df, 'col_name')
           col_name
    0  cleaned_val0
    1   cleaned_val1
"""

def validate_xxx(
     df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
     col: str, 
     input_format: str = "auto",
     kb_path: str = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
"""    
    Validate xxx type data in a DataFrame column. For each cell, return True or False.

    Parameters
    ----------
    df
            A pandas or Dask DataFrame containing the data to be validated.
    col
            The name of the column to be validated.
    input_format
            The input format of enumerate text string.
            If input_format = 'auto', the clean function will detect the input format automatically.
            Otherwise, the input_format should be the format specified by users. Note that for different types, the input_format accepts different string as parameters.

            (default: "auto") 
     kb_path
            The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing. By default, using the knowledge which Dataprep provides.
            
            (default: "default")
"""

def _format(
      val: Optional[str],
      output_format: str,
) -> Optional[str]:
""" 
   Reformat a enumerable text string with proper output format. 

   Parameters
   ----------
   val
            The value of enumerable text string.
   output_format
            The output format of standardized enumerate string.

            (default: ) 
"""

def _load_kb(
     kb_path: str = "default",
) -> Any:
""" 
    Load knowledge base from specified path 

    Parameters
    ----------
     kb_path
            The path of user specified knowledge base. In current stage, it should be in the user's local directory following by the format we proposing. By default, using the knowledge which Dataprep provides.
            
            (default: "default")
    
"""

Implementation-level Actions

  • [ ] Implement built-in cleaning functions.
  • [ ] Implement cleaning pipeline.
  • [ ] Add support for user-defined functions.

Additional Tasks

  • [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [ ] The label of this task is setting correctly.
  • [ ] The issue is assigned to the correct person.
  • [ ] The issue is linked to related Epic.
  • [ ] The documentation is changed accordingly.
  • [ ] Tests are added accordingly.

qidanrui avatar Jun 15 '21 03:06 qidanrui