bulk icon indicating copy to clipboard operation
bulk copied to clipboard

Clusterization capabilities

Open care1e55 opened this issue 3 years ago • 2 comments

Hi. Probably not only manual labeling but a certain clustering algorithms could be implemented and then manualy fixed with this tool. If agree I would love to implement in this tool optional use of clusterization technics from sklearn such as kmeans and dbscan. Also would love to partisipate in other activities

care1e55 avatar Sep 16 '22 10:09 care1e55

The goal for bulk is to remain a lightweight tool that just takes care of the "bulk annotation" bit. If you want to cluster the data yourself upfront and use those clusters as colors then you can totally already do that. Just as long as the resulting .csv file has and x and y column. One challenge with this approach is that bulk will only ever allow two dimensions to be drawn, so you'll also need to think about dimensionality reduction.

In my experience clustering is a very hard problem to get right because it's very hard to "know" if you've done clustering well. There's not a metric like "accuracy" that you can use in hindsight to help you compare approaches.

Instead, I prefer to just lower the dimensionality of an embedded dataset to eye-ball if clusters appear. If they do, and I can confirm by inspecting, then I attach a label. From here it's a classification problem, which allows me to circumvent the need for clustering.

koaning avatar Sep 16 '22 15:09 koaning

Feel free to tell me if I misinterpreted the request, but as-is I don't think this library needs to concern itself with clustering. That's something you're free to do upfront in a notebook as you prepare a csv file for this tool.

koaning avatar Sep 16 '22 15:09 koaning