luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

Create tool to analyze the vector distribution from a set of vectors or .vec file

Open shubhamvishu opened this issue 7 months ago • 1 comments

We could add a tool that could take a .vec file having vectors and generate a nice report(maybe visualization?) about the vector distribution, min, max values etc. This would be helpful to understand/study the nature of vectors.

shubhamvishu avatar Apr 30 '25 10:04 shubhamvishu

We already have a baby amoeba step here (mikes_tiny_vector_tool.py) -- let's rename it to something better (inspect_vectors.py?), and add some more stats:

  • Probably user has to specify dimensionality since .vec is just giant array of floats
  • Print yes/no boolean about whether the vectors are unit-sphere normalized
  • Print yes/no if vectors are malformed (e.g. annoying float values like +/- Inf, Nan (all its variants) appear
  • Print rough per-dimension stats: min, max, mean, median, stddev? Maybe even little baby sparkle histogram...

Later it'd be awesome to have any other interesting summary stats, e.g. are the vectors all in a limited set of angles (e.g. all values are non-negative, so they only range over 1/8th of the possible space) too.

Maybe this is trivial few lines of NumPy code? Or maybe such an open source tool (with sharing license e.g. AS2, MIT, BSD) already exists and we should poach / be inspired?

mikemccand avatar May 27 '25 19:05 mikemccand