tnkeeh
                                
                                 tnkeeh copied to clipboard
                                
                                    tnkeeh copied to clipboard
                            
                            
                            
                        Arabic cleaning, normalization and segmentation library.
 
  
 
tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.
Installation
pip install tnkeeh
Features
- Quick cleaning
- Segmentation
- Normalization
- Data splitting
Examples
Data Cleaning
import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)
Arguments
- segmentuses farasa for segmentation.
- remove_diacriticsremoves all diacritics.
- remove_special_charsremoves all sepcial chars.
- remove_englishremoves english alphabets and digits.
- normalizematch digits that have the same writing but different encodings.
- remove_tatweeltatweel character- ـis used a lot in arabic writing.
- remove_repeated_charsremove characters that appear three times in sequence.
- remove_html_elementsremove html elements in the form- with their attirbutes. 
- remove_linksremove links.
- remove_twitter_metaremove twitter mentions, links and hashtags.
- remove_long_wordsremove words longer than 15 chars.
- by_chunkread files by chunks with size- chunk_size.
HuggingFace datasets
import tnkeeh as tn 
from datasets import load_dataset
dataset = load_dataset('metrec')
cleaner = tn.Tnkeeh(remove_diacritics = True)
cleaned_dataset = cleaner.clean_hf_dataset(dataset, 'text')
Data Splitting
Splits raw data into training and testing using the split_ratio
import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)
Splits data and labels into training and testing using the split_ratio
import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)
Splits input and target data with ration split_ratio. Commonly used for translation
tn.split_parallel_data('ar_data.txt','en_data.txt')
Data Reading
Read split data, depending if it was raw or classification
import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)
Arguments
- mode = 0read raw data.
- mode = 1read labeled data.
- mode = 2read parallel data.
Contribution
This is an open source project where we encourage contributions from the community.
License
MIT license.
Citation
@misc{tnkeeh2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Preprocessing Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}