dataprep
dataprep copied to clipboard
Feature Proposal: clean_language functionality in clean module
Summary
Implement clean_language()
function to clean a table containing language.
Design-level Explanation Actions
- [x] Investigate prior art solutions for cleaning and validating language.
- [x] follow the unified API design for enumerate text types.
Design-level Explanation
Proposed function signature for clean_language()
:
def clean_language(
df: Union[pd.DataFrame, dd.DataFrame],
col: str,
input_format: Union[str, Tuple[str, ...]] = "auto",
output_format: str = "name",
kb_path: str = "default",
inplace: bool = False,
errors: str = "coerce",
progress: bool = True,
) -> pd.DataFrame:
"""
Clean language type data in a DataFrame column.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be cleaned.
col
The name of the column containing data of language type.
input_format
The ISO 639 input format of the language.
- 'auto': infer the input format
- 'name': language name ('English')
- 'alpha-2': alpha-2 code ('en')
- 'alpha-3': alpha-3 code ('eng')
Can also be a tuple containing any combination of input formats,
for example to clean a column containing name and alpha-2
codes set input_format to ('name', 'alpha-2').
(default: 'auto')
output_format
The desired ISO 639 format of the language.
- 'name': language name ('English')
- 'alpha-2': alpha-2 code ('en')
- 'alpha-3': alpha-3 code ('eng')
(default: 'name')
kb_path
The path of user specified knowledge base.
In current stage, it should be in the user's local directory
following by the format we proposing.
(default: 'default')
inplace
If True, delete the column containing the data that was cleaned.
Otherwise, keep the original column.
(default: False)
errors
How to handle parsing errors.
- 'coerce': invalid parsing will be set to NaN.
- 'ignore': invalid parsing will return the input.
- 'raise': invalid parsing will raise an exception.
(default: 'coerce')
progress
If True, display a progress bar.
(default: True)
Examples
--------
Clean a column of language data.
>>> df = pd.DataFrame({'language': ['en', 'ara']})
>>> clean_language(df, 'language')
language language_clean
0 en English
1 ara Arabic
"""
Proposed function signature for validate_language()
:
def validate_language(
df: Union[str, pd.Series, dd.Series, pd.DataFrame, dd.DataFrame],
col: str,
input_format: Union[str, Tuple[str, ...]] = "auto",
kb_path: str = "default"
) -> Union[bool, pd.Series, pd.DataFrame]:
"""
Validate language type data in a DataFrame column. For each cell, return True or False.
Parameters
----------
df
A pandas or Dask DataFrame containing the data to be validated.
col
The name of the column to be validated.
input_format
The ISO 639 input format of the language.
- 'auto': infer the input format
- 'name': language name ('English')
- 'alpha-2': alpha-2 code ('en')
- 'alpha-3': alpha-3 code ('eng')
Can also be a tuple containing any combination of input formats,
for example to clean a column containing name and alpha-2
codes set input_format to ('name', 'alpha-2').
(default: 'auto')
kb_path
The path of user specified knowledge base.
In current stage, it should be in the user's local directory
following by the format we proposing.
(default: "default")
"""
The following are two function signatures mentioned in the unified API design to implement clean and validate functions:
def _format(
val: Optional[str],
output_format: str,
) -> Optional[str]:
"""
Reformat a language string with proper output format.
Parameters
----------
val
The value of language string.
output_format
The output format of standardized language string.
"""
def _load_kb(
kb_path: str = "default",
) -> Any:
"""
Load knowledge base from specified path.
Parameters
----------
kb_path
The path of user specified knowledge base.
In current stage, it should be in the user's local directory
following by the format we proposing.
(default: "default")
"""
Implementation-level Explanation
The implementation of clean_language
will refer to clean_country
, but will eliminate the functionality of fuzzy matching since regexes are not available for language.
Input and output format
clean_language
follows formats in ISO 639 country codes. Language can be converted to/from these formats:
- "name": the language name,
- "alpha-2": two letter abbreviation,
- "alpha-3": three letter abbreviation.
The input_format
can also be "auto", which means the input format will be inferred based on the input.
Knowledge base
The default knowledge base is the database in pycountry (please refer iso639-x.jso
in its databases
folder). This could be a json file, and can be stored in Python in a dict.
For implementation, instead of calling API provided by pycountry directly, we will load their database and implement similar functionality by ourselves in order to support the knowledge base uploaded by users. If users want to replace the default knowledge base, they need to follow the format as the default one.
Rational and Alternatives
Compared to pycountry, clean_language
:
- Can automatically detect the format of input (with
input_format="auto"
), while pycountry requires users to specify it. - Allows users to replace the default knowledge base and therefore be more flexible.
Prior Art
Future Possibilities
Add fuzzy matching functionality by calculating edit distance.
Implementation-level Actions
- [x] Implement the function.
- [x] Test on real world datasets.
- [x] Test the function by changing knowledge bases.
Additional Tasks
- [x] This task is put into a correct pipeline (Development Backlog or In Progress).
- [x] The label of this task is setting correctly.
- [x] The issue is assigned to the correct person.
- [ ] The issue is linked to related Epic.
- [x] The documentation is changed accordingly.
- [x] Tests are added accordingly.