yard icon indicating copy to clipboard operation
yard copied to clipboard

iter_confusion_matrices a bottleneck

Open yasirs opened this issue 14 years ago • 3 comments

I am experimenting with very large datasets (~ 1e6 to 1e7 points). It seems that storing the data as (threshold, label) tuples and then computing the measures and confusion matrices in python is much, much slower than keeping the data in numpy arrays (where available), and doing vectorized operations on the arrays. I don't know if there is interest in something like this.

I might attempt to implement something like that to be abe to handle the large datasets.

yasirs avatar Oct 03 '11 18:10 yasirs

I would definitely be interested in a NumPy-based solution. I'd suggest you start working on it in a separate branch in your fork and then file pull requests when there's something to be merged.

Also, I'd be glad if you could keep the NumPy-based version API-compatible with the original one as much as possible. In the end, I would like to have a version which works with NumPy if that is installed, but which can also live without NumPy.

ntamas avatar Oct 03 '11 19:10 ntamas

Yeah, that's what I am trying to do. I have a branch 'fastnumpy' where I am writing this, and it will definitely run without numpy. The user-facing API, like data and cuve inits will be compatible, but I am trying to move away from the tuple based storage to independent rows, and this will change some public method signatures, but mostly those not used by most users.

yasirs avatar Oct 03 '11 19:10 yasirs

Great! Keep me posted.

ntamas avatar Oct 03 '11 19:10 ntamas