Example duplication
Hi, Perhaps you've started another example...
https://github.com/eddelbuettel/rcppannoy/blob/54d8d34b5bfe16f6353c5f4d474811e2a99913da/R/annoy.R#L166-L175
Good catch! git blame points at @AdamSpannbauer in 2019.
Adam, do you recall if you meant to fill another example in?
Hi, Small typo: there should be a "." in fileext. https://github.com/eddelbuettel/rcppannoy/blob/54d8d34b5bfe16f6353c5f4d474811e2a99913da/R/annoy.R#L180
Maybe you could append this example that recalls the 0 indexing and shows some sanity checks. Up to you.
library(RcppAnnoy)
# IRIS EXAMPLE -----------------------------------------------------------------
data(iris)
# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])
# BuildinG index
a <- new(AnnoyEuclidean, ncol(X))
a$setSeed(42)
# Load dataset into index; Annoy uses zero indexing
for (i in 1:nrow(X))
a$addItem(i - 1, X[i,])
# Build forest with 20 trees
a$build(50)
# Reports about the forest
a$getNItems()
a$getNTrees()
# Performing search
k <- 5 # number of nearest neighbors
nn.index <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X))
nn.index[i,] <- a$getNNsByVector(X[i,], k)
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# The first match is the query itself most of the time
plot(1:nrow(X), nn.index[,1])
# Explore the second nearest neighbor
opar = par(mfrow = c(2, 2))
for (i in 1:ncol(X))
plot(X[, i], X[nn.index[,2], i], xlab = colnames(X)[i], ylab = "nearest")
par(opar)
# Perform search with distance
k <- 5
nn.index <- matrix(nrow = nrow(X), ncol = k)
nn.distance <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X)) {
res <- a$getNNsByVectorList(X[i,], k, -1, TRUE)
nn.index[i,] <- res$item
nn.distance[i,] <- res$distance
}
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# Explore distance to the second nearest neighbor
hist(nn.distance[,2], xlab = "Distance to the 2nd NN",
main = "Histogram of distance")
# Unload index from memory
a$unload()
rm(a)
library(RcppAnnoy)
# IRIS EXAMPLE -----------------------------------------------------------------
data(iris)
# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])
# BuildinG index
a <- new(AnnoyEuclidean, ncol(X))
a$setSeed(42)
# Load dataset into index; Annoy uses zero indexing
for (i in 1:nrow(X))
a$addItem(i - 1, X[i,])
# Build forest with 20 trees
a$build(50)
# Reports about the forest
a$getNItems()
#> [1] 150
a$getNTrees()
#> [1] 50
# Performing search
k <- 5 # number of nearest neighbors
nn.index <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X))
nn.index[i,] <- a$getNNsByVector(X[i,], k)
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# The first match is the query itself most of the time
plot(1:nrow(X), nn.index[,1])

# Explore the second nearest neighbor
opar = par(mfrow = c(2, 2))
for (i in 1:ncol(X))
plot(X[, i], X[nn.index[,2], i], xlab = colnames(X)[i], ylab = "nearest")

par(opar)
# Perform search with distance
k <- 5
nn.index <- matrix(nrow = nrow(X), ncol = k)
nn.distance <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X)) {
res <- a$getNNsByVectorList(X[i,], k, -1, TRUE)
nn.index[i,] <- res$item
nn.distance[i,] <- res$distance
}
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# Explore distance to the second nearest neighbor
hist(nn.distance[,2], xlab = "Distance to the 2nd NN",
main = "Histogram of distance")

# Unload index from memory
a$unload()
rm(a)
Created on 2024-12-04 with reprex v2.1.1
Thanks for catching the missing dot in the fileext argument. I just added it, and re-created the help page where the example appears.
As for the examples: maybe in demo/ ? I will take another look later.
Thanks for your feedback. If you don't add the demo, no problem, but please, put a line of code in the example concerning adding 1 to the returned index, it might be helping.
Finally, an important point should be to mention the BiocNeighbors package. It simplifies the indexing and the interface and operates on whole matrices. But in my hands, using it to build an index of 4M points and to query sets of 100k points only lead to a gain of 10% of speed. Best.
We could also write it as a vignette -- there is already a contributed one. Its sources are in inst/rmd/ if you want to take a peek. (I like pre-made pdf vignette as they avoid all possible surprises in processing at CRAN or elsewhere.) We can discuss different approaches -- putting it into demo/ was just one suggestion.
I had actually never committed the fix to your original issue of the example duplication, now done which closed the issue.
The example is nice. I am committing it now as a new demo(). Let me know if you want credits in the ChangeLog.
Thanks, that's great as you did.
I am usually a stickler for an entry in the ChangeLog ... but this should do.
Nice to see you at INSERM in Marseille. I spent a few grad school years dans le deuxieme as I was in the Vieille Charite.