Askar Bozcan issues

Results 13 issues of


                                            Askar Bozcan

Remove datasets requiring access key

Most of datasets are hosted on S3, requiring access key and secret code making them inaccessible for pretty much all users. ALL data if provided as part of Sadedegel should...

dataset

cleanup

Make SBD independent of Sklearn version by utilizing ONNX runtime

Currently, SBD's reliance on a pickled Sklearn model forces Sadedegel to use a fixed 0.23.1 version of Sklearn. This might cause issues when installing Sadedegel. TODO: Convert the pipeline to...

highpriority

cleanup-stay

Remove/Hide config files

Although configs are a nice idea to expose customization to user, I believe it makes Sadedegel more opaque to use. We can keep them privately to set sensible defaults (and...

cleanup

Host Sphinx generated documentation on Sadedegel.ai

As the title says, related to #133, the documentation generated should be hosted on sadedegel.ai As a related note, documentation should be mirrored in Turkish.

documentation

cleanup-stay

Add a spelling correction module which uses dictionary of Turkish words either with: 1- By adding SymSpell dependency and using SymSpell (https://pypi.org/project/symspellpy/) 2- By implementing Norvig's algorithm from scratch (https://norvig.com/spell-correct.html)...

enhancement

cleanup-stay

Add fastText Turkish vectorization

Implement ~~Word2Vec~~ fastText vectorizer as BERT might be too heavy for most usage cases and TF-IDF vectors might not be sufficient. Might be a good idea to split this issue...

enhancement

fixed

cleanup-stay

Docstrings in Sphinx format

Add docstring to (public) methods and properties in Sphinx docstring format. Pull requests to address this issue can each separately address each module. See: https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html

documentation

lowprio

cleanup-stay

Getting BERT embeddings does not handle sequences longer than 512 (BERT's maximum sequence length)

As the title says. If the sentence contains more tokens than 512, maximum sequence length of BERT, `python IndexError: index out of range in self ` Is received from Embedding...

bug

fixed

Add spelling correction module [resolves #190]

As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by...

CUDA support and config for BERT in Doc

Set a config that will use GPU by default for computing bert_embeddings if CUDA is available. UPD: Also will auto-convert all inputs to BERT to cuda tensors.

enhancement

lowprio