Roman Yurchak issues

Results 155 issues of


                                            Roman Yurchak

Active learning query stategies

While the implementation of query strategies while doing text categorization iterations (active learning) is probably beyond the scope of FreeDiscovery, this issue aims to ensure than the output of the...

Integration with Solr

This is an issue that mirrors of the ElasticSearch integration #117 and aims to consider the integration of FreeDiscovery with Solr API Multiple aspects of this question are possible, >...

Consistent API for feature extraciton

Currently FreeDiscovery has 2 different and incompatible feature extraction modes determined by the [`use_hashing` option](https://freediscovery.github.io/doc/dev/API_reference.html#a-load-a-dataset-and-initialize-feature-extraction), ## Based on HashingVectorizer + TfidfTransformer - hashes the computed features, cannot recover the vocabulary...

Handling exceptions in the REST API

Currently when an exception happens in the REST API (e.g. wrong input arguments were provided), the following happens, - for all exceptions codes other than `HTTP 500`: a custom error...

help wanted

REST API

Benchmark near duplicate detection on the Quora dataset

Quora recently released a dataset of [400000 potential near duplicate sentences pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) with labels indicating whether they are indeed near-duplicates. There is also some work on this dataset done [here](https://explosion.ai/blog/quora-deep-text-pair-classification)...

Document internal file structure

We need to document how FreeDiscovery stores the extracted features and indexes documents, etc.

docs

Add language identification feature

It could be useful to add a language identification to FreeDiscovery. A possible approach could be to use the Python port of the Google's [language-detection](https://github.com/Mimino666/langdetect) library, which, as far as...

enhancement

Add NNMF

For future reference, Currently, FreeDiscovery uses TruncatedSVD (LSI) as a preprocessing step that transforms the raw document-term matrix ` [n_documents, n_features]` into the semantic space `[n_documents, n_components]`, which is then...

new feature

Efficient indexing for NN queries

Currently the Nearest Neighbor search is used in the following places, * categorization * DBSCAN * Sematnic search previous optimization attempts in #15 concluded that for NN search in the...

large scale

Scaling benchmarks

Currently FreeDiscovery is benchmarked on the 700k ERDM dataset. It might be worth considering the scalability to lager text datasets for instance with 1 to 10 M documents. This might...

large scale