Askar Bozcan comments

Results 14 comments of


                                            Askar Bozcan

Add fastText Turkish vectorization

Due to various ways to implement vectorization, with their own pros and cons below I am going to list some of the paths we can take for us to discuss....

Spelling Correction Test Results

Note to self: Either remove spelling correction entirely or revamp it completely with techniques that work better.

Getting BERT embeddings does not handle sequences longer than 512 (BERT's maximum sequence length)

Added an example to reproduce the bug.

Character Repetition Correction

See #190

Add spelling correction module [resolves #190]

Note to self: Modify the symspellpy distance calculation in such a way that changing Turkish umlaut-characters to English counterparts (ü -> u, ç->c) and vice versa (u -> ü, c...

Add spelling correction module [resolves #190]

As an extra note, see this: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f

Increase Test Coverage

ToDo: After strict typing is enforced.

Users can use their own tokenizer.

Note to self: Tokenizer interface should be easily extendable so that users can add their custom tokenizers if they so desire.

Implement IR based Supervised Sentence Ranker

ToDo: More data for supervised ranker summarizers.

Adding Optional Text Preprocessing Steps

Note to self: Should be done in a general way, allowing users to add their own custom preprocessing steps if necessary.