Roman Yurchak issues

Results 155 issues of


                                            Roman Yurchak

Implement IDF transforms

It would be necessary to implement IDF transforms, and possibly expose a `TfidfVectorizer` estimator. This requires selecting a sparse array library. For now, we use custom `CSRArray` structs to represent...

new feature

Build release wheels with LTO

Link time Optimization (LTO) would add around 10% in performance for vtext and also reduce binary sizes. However, it also increases the compilation/link time significantly so we may want to...

build / CI

performance

Make estimators picklables

Currently, Python classes / functions generated with Pyo3 are not picklable (https://github.com/PyO3/pyo3/issues/100) which makes their use problematic in typical data science workflows (e.g. with joblib parallel or in scikit-learn pipelines)....

python

Better unicode support in tokenization rules

Currently, the `VTextTokenizer` first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in...

tokenization

Handle detailed postal codes for Great Britain

The GB dataset for Great Britain only includes [outwards codes](https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Outward_code) (i.e. only first 3 letters). The full dataset is included in `GB_full`, but currently this fails to load, ```py >>>...

enhancement

good first issue

Faster dataset loading

Currently we load datasets with `pd.read_csv` from gzipped CSV format. Loading should be much improved by converting the data to parquet format and using `pd.read_parquet` (this might also reduce the...

Inverse postal code geocoding

It should be fairly straightforward to implement inverse geo-coding, i.e. find the closest city/region/country for given coordinates. This would require downloading all the datasets however. Ideally, the nearest neighbor lookup...

new feature

Scalability of the uncertainty propagation in numpy arrays

## Statement of the problem Currently uncertainties supports numpy arrays by stacking `uncertainties.ufloat` objects inside a `numpy.array( , dtype="object")` array. This is certainty nice as it allows to automatically use...

Internals

NumPy+uncertainties

Python wrapper

I am considering to start experimenting with a Python wrapper for sprs, probably in some separate repo. The situation with sparse array libraries is not so great in the Python...

Creating CSR array with duplicates fails

Currently, `CsMat::new` appears to fail on input that has duplicate values (i.e. when in a CSR matrix multiple indices for a row are the same). They are summed by default,...