Roman Yurchak
Roman Yurchak
While the implementation of query strategies while doing text categorization iterations (active learning) is probably beyond the scope of FreeDiscovery, this issue aims to ensure than the output of the...
This is an issue that mirrors of the ElasticSearch integration #117 and aims to consider the integration of FreeDiscovery with Solr API Multiple aspects of this question are possible, >...
Currently FreeDiscovery has 2 different and incompatible feature extraction modes determined by the [`use_hashing` option](https://freediscovery.github.io/doc/dev/API_reference.html#a-load-a-dataset-and-initialize-feature-extraction), ## Based on HashingVectorizer + TfidfTransformer - hashes the computed features, cannot recover the vocabulary...
Currently when an exception happens in the REST API (e.g. wrong input arguments were provided), the following happens, - for all exceptions codes other than `HTTP 500`: a custom error...
Quora recently released a dataset of [400000 potential near duplicate sentences pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) with labels indicating whether they are indeed near-duplicates. There is also some work on this dataset done [here](https://explosion.ai/blog/quora-deep-text-pair-classification)...
We need to document how FreeDiscovery stores the extracted features and indexes documents, etc.
It could be useful to add a language identification to FreeDiscovery. A possible approach could be to use the Python port of the Google's [language-detection](https://github.com/Mimino666/langdetect) library, which, as far as...
For future reference, Currently, FreeDiscovery uses TruncatedSVD (LSI) as a preprocessing step that transforms the raw document-term matrix ` [n_documents, n_features]` into the semantic space `[n_documents, n_components]`, which is then...
Currently the Nearest Neighbor search is used in the following places, * categorization * DBSCAN * Sematnic search previous optimization attempts in #15 concluded that for NN search in the...
Currently FreeDiscovery is benchmarked on the 700k ERDM dataset. It might be worth considering the scalability to lager text datasets for instance with 1 to 10 M documents. This might...