dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: A unified API design for unenumerable text types in clean module

Open qidanrui opened this issue 3 years ago • 0 comments

Summary

Unified API design for unenumerable text types. Note that unenumerable text types here indicate that the part of the data type is composed by enumerable types and the other part of it is composed by unenumerable strings. The example is URL.

Design-level Explanation Actions

  • [X] Investigate prior art solutions for cleaning text data.
  • [X] Decide which data cleaning operations to support.
  • [X] Decide how to specify operations (e.g. function parameters vs. piping methods)

Design-level Explanation

Proposed function signature for any data type belongs to unenumerable text type. Note that xxx stands for the name of data type. For implementation of each type, we only expose the cleaning function and validation function to users.

def clean_xxx(
    df: Union[pd.DataFrame, dd.DataFrame],
    col: str, 
    output_format: str,
    kb_pathes: Union[str, dict] = "default",
    split: bool = False, 
    inplace: bool = False,
    errors: str = 'coerce',
    progress: bool = True,
) -> pd.DataFrame:
"""
    Clean xxx type data in a DataFrame column.

    Parameters
    ----------
        df
            A pandas or Dask DataFrame containing the data to be cleaned.
        col
            The name of the column containing data of xxx type.
        output_format
            The output format of standardized enumerate string.

            (default: ) 
        kb_pathes
            The path of user specified knowledge base for included enumerable type functions. In current stage, it should be in the user's local directory following by the format we proposing. By default, we use the knowledge inner Dataprep.
            
            (default: "default")
        split       
            If True, each component of derived from its number string will be put into its own column.        
            For example, if split = True for URL, then there will be several columns:
            - scheme
            - host
            - domain
            - ......
            (default: False)   
        inplace        
           If True, delete the column containing the data that was cleaned. 
           Otherwise, keep the original column.        

           (default: False)    
        errors        
            How to handle parsing errors.            
            - ‘coerce’: invalid parsing will be set to NaN.           
            - ‘ignore’: invalid parsing will return the input.            
            - ‘raise’: invalid parsing will raise an exception.        
           
            (default: 'coerce')     
        progress        
            If True, display a progress bar.        
           
            (default: True)

    Examples
    --------
    Clean a column of xxx data.

    >>> df = pd.DataFrame({content of DataFrame})
    >>> clean_xxx(df, 'col_name')
           col_name
    0  cleaned_val0
    1   cleaned_val1
"""

def validate_xxx(
     df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
     col: str, 
     kb_path: Union[str, dict] = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
"""    
    Validate xxx type data in a DataFrame column. For each cell, return True or False.

    Parameters
    ----------
    df
            A pandas or Dask DataFrame containing the data to be validated.
    col
            The name of the column to be validated.
    kb_pathes
            The path of user specified knowledge base for included enumerable type functions. In current stage, it should be in the user's local directory following by the format we proposing. By default, we use the knowledge inner Dataprep.
            
            (default: "default")
"""

def _split(
     val: Optional[str]
) -> Optional[dict]:
""" 
    Split unenumerate text into several parts according to rules like regex, including enumerable part and unenumerable part.
    Parameters
    ----------
    val
            The value of unenumerable text string. 
"""
 
def _format(
      val: Optional[str],
      output_format: str,
) -> Optional[str]:
""" 
   Reformat each enumerable part string with proper output format. 

   Parameters
   ----------
   val
            The value of enumerable text string.
   output_format
            The output format of standardized enumerate string.

            (default: ) 
"""

Implementation-level Actions

  • [ ] Implement built-in cleaning functions.
  • [ ] Implement cleaning pipeline.
  • [ ] Add support for user-defined functions.

Additional Tasks

  • [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [ ] The label of this task is setting correctly.
  • [ ] The issue is assigned to the correct person.
  • [ ] The issue is linked to related Epic.
  • [ ] The documentation is changed accordingly.
  • [ ] Tests are added accordingly.

qidanrui avatar Jun 15 '21 03:06 qidanrui