dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Feature Proposal: clean_language functionality in clean module

Open NoirTree opened this issue 3 years ago • 0 comments

Summary

Implement clean_language() function to clean a table containing language.

Design-level Explanation Actions

  • [x] Investigate prior art solutions for cleaning and validating language.
  • [x] follow the unified API design for enumerate text types.

Design-level Explanation

Proposed function signature for clean_language():

def clean_language(
        df: Union[pd.DataFrame, dd.DataFrame],
        col: str,
        input_format: Union[str, Tuple[str, ...]] = "auto",
        output_format: str = "name",
        kb_path: str = "default",
        inplace: bool = False,
        errors: str = "coerce",
        progress: bool = True,
) -> pd.DataFrame:
    """
        Clean language type data in a DataFrame column.

        Parameters
        ----------
            df
                A pandas or Dask DataFrame containing the data to be cleaned.
            col
                The name of the column containing data of language type.
            input_format
                The ISO 639 input format of the language.
                    - 'auto': infer the input format
                    - 'name': language name ('English')
                    - 'alpha-2': alpha-2 code ('en')
                    - 'alpha-3': alpha-3 code ('eng')

                Can also be a tuple containing any combination of input formats,
                for example to clean a column containing name and alpha-2
                codes set input_format to ('name', 'alpha-2').

                (default: 'auto')
            output_format
                The desired ISO 639 format of the language.
                    - 'name': language name ('English')
                    - 'alpha-2': alpha-2 code ('en')
                    - 'alpha-3': alpha-3 code ('eng')

                (default: 'name')
            kb_path
                The path of user specified knowledge base.
                In current stage, it should be in the user's local directory
                following by the format we proposing.

                (default: 'default')
            inplace
               If True, delete the column containing the data that was cleaned.
               Otherwise, keep the original column.

               (default: False)
            errors
                How to handle parsing errors.
                - 'coerce': invalid parsing will be set to NaN.
                - 'ignore': invalid parsing will return the input.
                - 'raise': invalid parsing will raise an exception.

                (default: 'coerce')
            progress
                If True, display a progress bar.

                (default: True)

        Examples
        --------
        Clean a column of language data.

        >>> df = pd.DataFrame({'language': ['en', 'ara']})
        >>> clean_language(df, 'language')
            language    language_clean
        0   en          English
        1   ara         Arabic
    """

Proposed function signature for validate_language():

def validate_language(
        df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
        col: str,
        input_format: Union[str, Tuple[str, ...]] = "auto",
        kb_path: str = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
    """
        Validate language type data in a DataFrame column. For each cell, return True or False.

        Parameters
        ----------
        df
            A pandas or Dask DataFrame containing the data to be validated.
        col
            The name of the column to be validated.
        input_format
            The ISO 639 input format of the language.
                - 'auto': infer the input format
                - 'name': language name ('English')
                - 'alpha-2': alpha-2 code ('en')
                - 'alpha-3': alpha-3 code ('eng')

            Can also be a tuple containing any combination of input formats,
            for example to clean a column containing name and alpha-2
            codes set input_format to ('name', 'alpha-2').

            (default: 'auto')
        kb_path
            The path of user specified knowledge base.
            In current stage, it should be in the user's local directory
            following by the format we proposing.

            (default: "default")
    """

The following are two function signatures mentioned in the unified API design to implement clean and validate functions:

def _format(
        val: Optional[str],
        output_format: str,
) -> Optional[str]:
    """
       Reformat a language string with proper output format.

       Parameters
       ----------
       val
            The value of language string.
       output_format
            The output format of standardized language string.
    """


def _load_kb(
        kb_path: str = "default",
) -> Any:
    """
        Load knowledge base from specified path.

        Parameters
        ----------
        kb_path
                The path of user specified knowledge base.
                In current stage, it should be in the user's local directory
                following by the format we proposing.

                (default: "default")
    """

Implementation-level Explanation

The implementation of clean_language will refer to clean_country, but will eliminate the functionality of fuzzy matching since regexes are not available for language.

Input and output format

clean_language follows formats in ISO 639 country codes. Language can be converted to/from these formats:

  1. "name": the language name,
  2. "alpha-2": two letter abbreviation,
  3. "alpha-3": three letter abbreviation.

The input_format can also be "auto", which means the input format will be inferred based on the input.

Knowledge base

The default knowledge base is the database in pycountry (please refer iso639-x.jso in its databases folder). This could be a json file, and can be stored in Python in a dict. For implementation, instead of calling API provided by pycountry directly, we will load their database and implement similar functionality by ourselves in order to support the knowledge base uploaded by users. If users want to replace the default knowledge base, they need to follow the format as the default one.

Rational and Alternatives

Compared to pycountry, clean_language:

  • Can automatically detect the format of input (with input_format="auto"), while pycountry requires users to specify it.
  • Allows users to replace the default knowledge base and therefore be more flexible.

Prior Art

  1. pycountry
  2. Implementation of clean_country function in DataPrep

Future Possibilities

Add fuzzy matching functionality by calculating edit distance.

Implementation-level Actions

  • [x] Implement the function.
  • [x] Test on real world datasets.
  • [x] Test the function by changing knowledge bases.

Additional Tasks

  • [x] This task is put into a correct pipeline (Development Backlog or In Progress).
  • [x] The label of this task is setting correctly.
  • [x] The issue is assigned to the correct person.
  • [ ] The issue is linked to related Epic.
  • [x] The documentation is changed accordingly.
  • [x] Tests are added accordingly.

NoirTree avatar Jun 20 '21 11:06 NoirTree