Askar Bozcan
Askar Bozcan
Most of datasets are hosted on S3, requiring access key and secret code making them inaccessible for pretty much all users. ALL data if provided as part of Sadedegel should...
Currently, SBD's reliance on a pickled Sklearn model forces Sadedegel to use a fixed 0.23.1 version of Sklearn. This might cause issues when installing Sadedegel. TODO: Convert the pipeline to...
Although configs are a nice idea to expose customization to user, I believe it makes Sadedegel more opaque to use. We can keep them privately to set sensible defaults (and...
As the title says, related to #133, the documentation generated should be hosted on sadedegel.ai As a related note, documentation should be mirrored in Turkish.
Add a spelling correction module which uses dictionary of Turkish words either with: 1- By adding SymSpell dependency and using SymSpell (https://pypi.org/project/symspellpy/) 2- By implementing Norvig's algorithm from scratch (https://norvig.com/spell-correct.html)...
Implement ~~Word2Vec~~ fastText vectorizer as BERT might be too heavy for most usage cases and TF-IDF vectors might not be sufficient. Might be a good idea to split this issue...
Add docstring to (public) methods and properties in Sphinx docstring format. Pull requests to address this issue can each separately address each module. See: https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html
Getting BERT embeddings does not handle sequences longer than 512 (BERT's maximum sequence length)
As the title says. If the sentence contains more tokens than 512, maximum sequence length of BERT, `python IndexError: index out of range in self ` Is received from Embedding...
As title says, added a spelling correction module which utilizes SymSpellPy at the backend and as the vocabulary for spelling correction, approximately 450k~ term frequency vocabulary has been created by...
Set a config that will use GPU by default for computing bert_embeddings if CUDA is available. UPD: Also will auto-convert all inputs to BERT to cuda tensors.