WIP: Add benchmark for tfidf
Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents
We use a cosine similarity for calculating relevancy of documents for a given query.
@ashvardanian , thanks for your time ! Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark
Use the leipzig1m and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in theleipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.
-
Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)
-
The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)
-
Once we have a cosine similarity score, sort and fetch top 10.
For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ? And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.
Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation?
@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF)
In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and array[row][column] = frequency of the token in the document
Similarly, we can tokenize the query and run intersect between the query (as a row vector) and each document in our corpus to get the intersection size.
the intersection would be if the frequency of every unique token in the query matches the frequency of the term in the document (something like using simsimd_intersect ). Is that what you had in mind using sparse kernels ?
@GoWind, the intersect function may not be the only one you need. Also look into: spdot_counts and spdot_weights 🤗
will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :)
Hi @ashvardanian , making progress on the tfidf based similarity calculator. I noticed some discrepancy when calculating the cosine similarity for vec of f64s via the Rust bindings The plain rust cosine calculations match the values I get both from numpy and from the simsimd bindings via Python For the implementation I tried to compute a vector of tfidf_values per query and then compute a cosine similarity between the query and each document , based on the answers on the SO question
Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be [[bench]] in the subsequent commits)
head -n 10000 leipzig1m.txt > leipzig10000.txt
cargo run --bin tfidf leipzig10000.txt
I took the first 10k lines and batched them into 10 document of 1k lines each. The query (hardcoded) in the script is transformed into a vector representation and I compute the cosine similarities
Similarity for document via simsimd 0: Some(0.5928754241421308)
Similarity for document via plain cosine similarity 0: Some(0.40712457585786876)
Similarity for document via simsimd 1: Some(0.5993249839897541)
Similarity for document via plain cosine similarity 1: Some(0.4006750160102468)
Similarity for document via simsimd 2: Some(0.5914559242162761)
Similarity for document via plain cosine similarity 2: Some(0.408544075783724)
Similarity for document via simsimd 3: Some(0.5998267820476098)
Similarity for document via plain cosine similarity 3: Some(0.40017321795239075)
Similarity for document via simsimd 4: Some(0.5906444006555799)
Similarity for document via plain cosine similarity 4: Some(0.40935559934442023)
Similarity for document via simsimd 5: Some(0.5902192553116478)
Similarity for document via plain cosine similarity 5: Some(0.4097807446883521)
Similarity for document via simsimd 6: Some(0.5943923707602529)
Similarity for document via plain cosine similarity 6: Some(0.4056076292397477)
Similarity for document via simsimd 7: Some(0.6028015678055032)
Similarity for document via plain cosine similarity 7: Some(0.3971984321944968)
Similarity for document via simsimd 8: Some(0.5957380843868555)
Similarity for document via plain cosine similarity 8: Some(0.4042619156131435)
Similarity for document via simsimd 9: Some(0.5913356879962984)
Similarity for document via plain cosine similarity 9: Some(0.4086643120037022)
Not sure if I am doing something wrong, but could there be some sort of discrepancy here ?
Seems like one is x and the other is 1-x. One is a similarity score and the other is a distance.
Ah, I see. now it makes sense
SIMSIMD_INTERNAL simsimd_distance_t _simsimd_cos_normalize_f64_neon(simsimd_f64_t ab, simsimd_f64_t a2,
simsimd_f64_t b2) {
if (a2 == 0 && b2 == 0) return 0;
if (ab == 0) return 1;
simsimd_f64_t squares_arr[2] = {a2, b2};
float64x2_t squares = vld1q_f64(squares_arr);
......
rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
vst1q_f64(squares_arr, rsqrts);
simsimd_distance_t result = 1 - ab * squares_arr[0] * squares_arr[1];
return result > 0 ? result : 0;
}
I was able to get a benchmark of running the cosine search across a database against a query and the SimSIMD version run upto 5x faster than the plain Rust version.
I am still not sure how to use spdot here instead of a cosine similarity. Can you give me some pointers on how I can employ spdotand benchmark it (now I have most of the foundation for a TF-IDF implementation, reckon it should be easier)
Benchmarks done on a M2 Pro with 32GB of Ram
For the dataset I used the first 10k lines from the Leipzig dataset head -n 10000 leipzig1m.txt > leipzig10000.txt
warning: `simsimd` (bench "tfidf") generated 1 warning (run `cargo fix --bench "tfidf"` to apply 1 suggestion)
Finished bench [optimized] target(s) in 0.08s
Running scripts/bench_tfidf.rs (target/release/deps/tfidf-601c525378540ae8)
Gnuplot not found, using plotters backend
Benchmarking TF-IDF Similarity/SimSIMD Cosine Similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.7s or enable flat sampling.
TF-IDF Similarity/SimSIMD Cosine Similarity
time: [120.97 ms 121.69 ms 122.57 ms]
change: [-10.060% -6.1234% -2.7896%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
Benchmarking TF-IDF Similarity/Rust plain Cosine similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.4s.
TF-IDF Similarity/Rust plain Cosine similarity
time: [534.30 ms 535.29 ms 536.42 ms]
change: [-8.7229% -5.3702% -2.4358%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
Thanks for sharing these benchmarks—great to see SimSIMD achieving a 5x speedup over the Rust implementation! To use spdot, you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF.
For benchmarking, precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values) and replace the cosine similarity calculation with spdot. This avoids unnecessary operations like normalization and takes full advantage of sparsity. Let me know how the updated results compare—excited to see the numbers!
you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF. precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values)
Got it. thanks for the pointers. I will setup the benchmarks and share numbers again. This does look promising !
Hey @GoWind! I've added a placeholder file for this benchmark on the main-dev branch of this repo. Let's move it there 🤗
Thanks!
Sure, will move the benchmark code to the repo linked !
Hi @ashvardanian , I noticed that there isn't a NEON implementation for simsimd_spdot_weights_u16 (or simsimd_spdot_counts_u16 either). I think it might make sense to add it as part of the benchmark as I am running this on a Mac (and other ARM devices might benefit anyway)
Also, I wrote the tfidf benchmark using f64 and realized that there isn't a spdot version for u16 with f32 or f64 weights either. Does it make sense to add them as well ?
Also , there seems to be simsimd native bf16 type and a half bf16 types and both atleast at the type level are not compatible. Do you think it might be safe to use the half::bf16 types and then cast them to simsimd bf16s ?
Hi @ashvardanian , a Happy New Year,
I didn't have a lot of time to work on this, but could squeeze out some time during the holidays to finish up some stuff. I wrote a f32 neon implementation for spdot. I noticed that there seems to be only implementation for sve2 for bf16. My algorithm doesn't seems to be slower than the serial one and I was wondering if you could provide any insights for me on maybe speeding it up ?
The vectors I am comparing are about 20-50 elements big. I will try to do one more round with a larger number of elements to see if the algorithm gets faster beyond a certain size.
Would you also be interested in a f16 implementation on NEON ? If there are any improvements I can make to the f32 version, I think I might be able to port them to the f16 / bf16 version as well.
All measurements done on an M2 PRO with 32GB of RAM.