dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: A unified API design for number types in clean module

Open qidanrui opened this issue 3 years ago • 0 comments

Summary

Unified API design for number types. Note that number types here indicate the data type composed by numbers (0 - 9), alphabets (a - z and A - Z) and separators (. , - etc.) .

Design-level Explanation Actions

  • [X] Investigate prior art solutions for cleaning text data.
  • [X] Decide which data cleaning operations to support.
  • [X] Decide how to specify operations (e.g. function parameters vs. piping methods)

Design-level Explanation

Proposed function signature for any data type belongs to number type. Note that xxx stands for the name of data type. For implementation of each type, we only expose the cleaning function and validation function to users.

def clean_xxx(
    df: Union[pd.DataFrame, dd.DataFrame],
    col: str,
    output_format: str = "standard",
    extract_info: bool = False,
    inplace: bool = False,
    errors: str = 'coerce',
    progress: bool = True,
) -> pd.DataFrame:
"""
    Clean xxx type data in a DataFrame column.

    Parameters
    ----------
        df
            A pandas or Dask DataFrame containing the data to be cleaned.
        col
            The name of the column containing data of xxx type.
        output_format
            The output format of standardized number string.
            If output_format = 'compact', return string without any separators.
            If output_format = 'standard', return string with proper separators. 

            (default: "standard") 
        split       
            If True, each component of derived from its number string will be put into its own column.        
            For example, if split = True for credit card, then there will be four columns:
            - issue_number
            - bank_number
            - account_number
            - check_digit

            (default: False)   
        inplace        
           If True, delete the column containing the data that was cleaned. 
           Otherwise, keep the original column.        

           (default: False)    
        errors        
            How to handle parsing errors.            
            - ‘coerce’: invalid parsing will be set to NaN.           
            - ‘ignore’: invalid parsing will return the input.            
            - ‘raise’: invalid parsing will raise an exception.        
           
            (default: 'coerce')    
        progress        
            If True, display a progress bar.        
           
            (default: True)

    Examples
    --------
    Clean a column of xxx data.

    >>> df = pd.DataFrame({content of DataFrame})
    >>> clean_xxx(df, 'col_name')
           col_name
    0  cleaned_val0
    1   cleaned_val1
"""

def validate_xxx(
     df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
     col: str,
) -> Union[bool, pd.Series, pd.DataFrame]:
"""    
    Validate xxx type data in a DataFrame column. For each cell, return True or False.

    Parameters
    ----------
    df
            A pandas or Dask DataFrame containing the data to be validated.
    col
            The name of the column to be validated.
"""

def _compact(
     val: str
) -> Optional[str]:
""" 
    Compact a number string without any separators. 

    Parameters
    ----------
    val
            The value of number string.
"""

def _format(
      val: str
      output_format: str = "standard",
) -> Optional[str]:
""" 
     Reformat a number string with proper separators. 
     
     Parameters
     ----------
     val
            The value of number string.
     output_format
            If output_format = 'compact', call `_compact` function and return string without any separators. 
            If output_format = 'standard', return string with proper separators function. 
"""

Implementation-level Explanation

The implementation is based on the basic functionalities which python-stdnum has.

A typical function in python-stdnum is composed by following parts:

  • validate(number): the function of validate(number) is to check if the format of input number string is valid. If yes, return compacted (no splitters) number string. If no, raise error InvalidComponent().
  • is_valid(number): the wrapper of validate(number). If validate(number) raises error, return False. Otherwise, return True.
  • compact(number): Omit all splitters in the number string and return.
  • format(number): Add proper splitters to standardize the number string into official format.

Our implementation for the data types in this library tries to call the validation functions of all the types in sequence. If the output of is_valid() is True, we try to match keyword that users input. According to the number of output columns user expected, generate final standardizing results.

Rational and Alternatives

Comparing to python-stdnum, our clean module covers their implemented functionalities with more types in the future.

Prior Art

python-stdnum: A Python library that aims to provide functions to handle, parse and validate standard numbers.

Future Possibilities

Inject knowledge database from web.

Implementation-level Actions

  • [ ] Implement built-in cleaning functions.
  • [ ] Implement cleaning pipeline.
  • [ ] Add support for user-defined functions.

Additional Tasks

  • [ ] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [ ] The label of this task is setting correctly.
  • [ ] The issue is assigned to the correct person.
  • [ ] The issue is linked to related Epic.
  • [ ] The documentation is changed accordingly.
  • [ ] Tests are added accordingly.

qidanrui avatar Jun 08 '21 07:06 qidanrui