rcppannoy Example duplication

Hi, Perhaps you've started another example...

https://github.com/eddelbuettel/rcppannoy/blob/54d8d34b5bfe16f6353c5f4d474811e2a99913da/R/annoy.R#L166-L175

Oct 25 '24 16:10 SamGG

Good catch! git blame points at @AdamSpannbauer in 2019.

Adam, do you recall if you meant to fill another example in?

Oct 25 '24 16:10 eddelbuettel

Hi, Small typo: there should be a "." in fileext. https://github.com/eddelbuettel/rcppannoy/blob/54d8d34b5bfe16f6353c5f4d474811e2a99913da/R/annoy.R#L180

Dec 04 '24 21:12 SamGG

Maybe you could append this example that recalls the 0 indexing and shows some sanity checks. Up to you.

library(RcppAnnoy)

# IRIS EXAMPLE -----------------------------------------------------------------

data(iris)

# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])

# BuildinG index
a <- new(AnnoyEuclidean, ncol(X))
a$setSeed(42)
# Load dataset into index; Annoy uses zero indexing
for (i in 1:nrow(X))
  a$addItem(i - 1, X[i,])
# Build forest with 20 trees
a$build(50)
# Reports about the forest
a$getNItems()
a$getNTrees()

# Performing search
k <- 5 # number of nearest neighbors
nn.index <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X))
  nn.index[i,] <- a$getNNsByVector(X[i,], k)
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# The first match is the query itself most of the time
plot(1:nrow(X), nn.index[,1])
# Explore the second nearest neighbor
opar = par(mfrow = c(2, 2))
for (i in 1:ncol(X))
  plot(X[, i], X[nn.index[,2], i], xlab = colnames(X)[i], ylab = "nearest")
par(opar)

# Perform search with distance
k <- 5
nn.index <- matrix(nrow = nrow(X), ncol = k)
nn.distance <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X)) {
  res <- a$getNNsByVectorList(X[i,], k, -1, TRUE)
  nn.index[i,] <- res$item
  nn.distance[i,] <- res$distance
}  
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# Explore distance to the second nearest neighbor
hist(nn.distance[,2], xlab = "Distance to the 2nd NN", 
     main = "Histogram of distance")

# Unload index from memory
a$unload()
rm(a)

library(RcppAnnoy)

# IRIS EXAMPLE -----------------------------------------------------------------

data(iris)

# Converts to numeric, ignoring the species
X <- as.matrix(iris[,-5])

# BuildinG index
a <- new(AnnoyEuclidean, ncol(X))
a$setSeed(42)
# Load dataset into index; Annoy uses zero indexing
for (i in 1:nrow(X))
  a$addItem(i - 1, X[i,])
# Build forest with 20 trees
a$build(50)
# Reports about the forest
a$getNItems()
#> [1] 150
a$getNTrees()
#> [1] 50

# Performing search
k <- 5 # number of nearest neighbors
nn.index <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X))
  nn.index[i,] <- a$getNNsByVector(X[i,], k)
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# The first match is the query itself most of the time
plot(1:nrow(X), nn.index[,1])

# Explore the second nearest neighbor
opar = par(mfrow = c(2, 2))
for (i in 1:ncol(X))
  plot(X[, i], X[nn.index[,2], i], xlab = colnames(X)[i], ylab = "nearest")

par(opar)

# Perform search with distance
k <- 5
nn.index <- matrix(nrow = nrow(X), ncol = k)
nn.distance <- matrix(nrow = nrow(X), ncol = k)
for (i in 1:nrow(X)) {
  res <- a$getNNsByVectorList(X[i,], k, -1, TRUE)
  nn.index[i,] <- res$item
  nn.distance[i,] <- res$distance
}  
# Annoy uses zero indexing, so index must be incremented
nn.index = nn.index + 1
# Explore distance to the second nearest neighbor
hist(nn.distance[,2], xlab = "Distance to the 2nd NN", 
     main = "Histogram of distance")


# Unload index from memory
a$unload()
rm(a)

^{Created on 2024-12-04 with reprex v2.1.1}

Dec 04 '24 21:12 SamGG

Thanks for catching the missing dot in the fileext argument. I just added it, and re-created the help page where the example appears.

As for the examples: maybe in demo/ ? I will take another look later.

Dec 04 '24 21:12 eddelbuettel

Thanks for your feedback. If you don't add the demo, no problem, but please, put a line of code in the example concerning adding 1 to the returned index, it might be helping.

Finally, an important point should be to mention the BiocNeighbors package. It simplifies the indexing and the interface and operates on whole matrices. But in my hands, using it to build an index of 4M points and to query sets of 100k points only lead to a gain of 10% of speed. Best.

Dec 04 '24 22:12 SamGG

We could also write it as a vignette -- there is already a contributed one. Its sources are in inst/rmd/ if you want to take a peek. (I like pre-made pdf vignette as they avoid all possible surprises in processing at CRAN or elsewhere.) We can discuss different approaches -- putting it into demo/ was just one suggestion.

Dec 05 '24 01:12 eddelbuettel

I had actually never committed the fix to your original issue of the example duplication, now done which closed the issue.

Dec 07 '24 15:12 eddelbuettel

The example is nice. I am committing it now as a new demo(). Let me know if you want credits in the ChangeLog.

Dec 07 '24 15:12 eddelbuettel

Thanks, that's great as you did.

Dec 07 '24 16:12 SamGG

I am usually a stickler for an entry in the ChangeLog ... but this should do.

Nice to see you at INSERM in Marseille. I spent a few grad school years dans le deuxieme as I was in the Vieille Charite.

Dec 07 '24 16:12 eddelbuettel