Matti Lyra
Matti Lyra
Neither one of these added metrics has been added to the mini batch kmeans at this point.
Allow a stream to be fed in and deduplicated in parallel. Obivously the deduplication itself can not happen in parallel but shingling and minhashing the documents can. Given a fast...
at the moment all the documents are stored in an in-memory database - it should be possible to define this to be anything that supports getting/setting items
lsh should be `pip` installable, use `cookiecutter`
SimHash is another LSH technique for near duplicate detection, it relies on cosine similarity instead of Jaccard similarity. https://en.wikipedia.org/wiki/SimHash https://doi.org/10.1145/509907.509965
When adding a new `dask_ec2.Instance` to an existing `dask_ec2.Cluster ` the username and keypair parameters are not copied to the instance, which consequently causes the `ssh_client` of the `Instance` to...
I really like creating slide presentation using Jupyter Notebooks, but the workflow is currently fairly cumbersome. I often find myself wanting to control HTML `` tag parameters like - slide...
I've been piecing together an auto-scaling `dask` cluster on AWS, using `adaptive` and bits from `dask_ec2`. It would be really useful to know what the semantics of the `scale_up` and...
There is an issue with how the scheduler assigns tasks from the `unrannable` queue to workers who meet the resource requirements joining the scheduler. The use case is some long...
The subprocesses running under ShellBolt ShellSpout should have information about the topology context they run in, specifically their component ID and the sources and targets. I think there is already...