David Jurgens
David Jurgens
Just to chime in, we've seen this same issue crop up with the `irds:nfcorpus/dev` dataset too. @seanmacavaney is there any updated on getting the encoding fix branched merged? I only...
Due to the terms of service for several datasets, we can't officially release the training data via github so there's no n-gram data in it. On Sat, Nov 23, 2019...
Hi Jiang, You'll need to compute the words you want to use first and then use the --token-filter option to restrict which words are retained. Also, please use the mailing...
Overall, the changes look good. I have a few concerns where the graph package and some of the matrix-as-graph classes overlap in functionality. It would be good to present a...
Hi Johann, This looks much cleaner than what we had and fixes the issue. I'm happy to integrate this into the trunk if you want! Thanks, David On Sat, Apr...
Hi Luboš, You bring up a good point. Our implementation of TF-IDF is using the term's _probability_ in the document, rather than its frequency. Using the probability discounts the impact...
The first two are bugs. I'm going to add @Ignore to the other tests until we decide to fix them, as they are fairly old now. I'll push the changes...
Hi Guilherme, Thanks for spotting this! Yes, the documentation is a bit out of date. We hadn't distributed the jar packages since people need to compile for different versions of...
These errors look like they're due to the localization differences in how decimal numbers are formatted, e.g., "1.500" vs. "1,500". I thought we had addressed this at one point, but...
Hi Cheryl, I would like to use the public Assignments cluster(Matrix m, Properties > props). You can convert it to an ArrayMatrix instance which will wrap the array, i.e., new...