Creating a vector index after nodes already exists doesn't make vectorNeighbors return anything
Hello,
ArcadeDB version : Using yesterday's release (HEAD actually)
Following this issue (data is created using this issue) https://github.com/ArcadeData/arcadedb/issues/2908 which makes me create a vector index AFTER nodes are created :
See there is some data, I created the index after the nodes were created.
Running SELECT vectorNeighbors('EmbeddingNode2[vector]', [0.0, 0.0, 0.0, 0.0], 10)
returns nothing
ArcadeDB - The Next Generation Multi-Model DBMS
vectorNeighbors('EmbeddingNode2[vector]', [0.0, 0.0, 0.0, 0.0], 10)
[]
@ExtReMLapin please get the latest main branch, I've just pushed some fixes to the LSM vector and optimization (60% less space on disk).
I've also just added a test case similar to yours and it passes:
@Test
void testVectorNeighborsViaSQL() {
database.transaction(() -> {
// Create vertex type with vector property
database.command("sql", "CREATE VERTEX TYPE Product IF NOT EXISTS");
database.command("sql", "CREATE PROPERTY Product.name IF NOT EXISTS STRING");
database.command("sql", "CREATE PROPERTY Product.embedding IF NOT EXISTS ARRAY_OF_FLOATS");
// Create LSM vector index on embedding property
database.command("sql", """
CREATE INDEX IF NOT EXISTS ON Product (embedding) LSM_VECTOR
METADATA {
"dimensions": 128,
"similarity": "COSINE",
"maxConnections": 16,
"beamWidth": 100
}""");
});
// Insert test data with 128-dimensional vectors
final int numDocs = 50;
final List<String> productNames = new ArrayList<>();
database.transaction(() -> {
for (int i = 0; i < numDocs; i++) {
final float[] embedding = new float[128];
// Create vectors with patterns:
// - First 10 vectors cluster around [1.0, 0.0, 0.0, ...]
// - Next 10 vectors cluster around [0.0, 1.0, 0.0, ...]
// - Next 10 vectors cluster around [0.0, 0.0, 1.0, ...]
// - Remaining vectors are more random
if (i < 10) {
embedding[0] = 1.0f + (i * 0.1f);
embedding[1] = 0.1f * i;
} else if (i < 20) {
embedding[0] = 0.1f * (i - 10);
embedding[1] = 1.0f + ((i - 10) * 0.1f);
} else if (i < 30) {
embedding[0] = 0.1f * (i - 20);
embedding[1] = 0.1f * (i - 20);
embedding[2] = 1.0f + ((i - 20) * 0.1f);
} else {
// Random-ish vectors for the rest
for (int j = 0; j < 128; j++) {
embedding[j] = (float) Math.sin(i * j * 0.01);
}
}
final String name = "Product_" + i;
productNames.add(name);
database.command("sql",
"INSERT INTO Product SET name = ?, embedding = ?",
name, embedding);
}
});
System.out.println("Inserted " + numDocs + " products with 128-dimensional vectors");
// Test 1: Find neighbors of first product (should find products 1-9 as nearest neighbors)
database.transaction(() -> {
final var result = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', embedding, 5) as neighbors FROM Product WHERE name = 'Product_0'");
assertThat(result.hasNext()).as("Query should return results").isTrue();
final var doc = result.next();
final String name = doc.getProperty("name");
assertThat(name).as("Should get Product_0").isEqualTo("Product_0");
// The neighbors should include other products from cluster 0-9
System.out.println("Neighbors of Product_0: " + doc.toJSON());
});
// Test 2: Query using vectorNeighbors with arbitrary query vector
database.transaction(() -> {
// Create a query vector similar to cluster 1 (second cluster)
final float[] queryVector = new float[128];
queryVector[1] = 1.0f; // Similar to products 10-19
// Use vectorNeighbors to find nearest neighbors
final var result = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', ?, 5) as neighbors FROM Product LIMIT 1",
queryVector);
assertThat(result.hasNext()).as("Query should return results").isTrue();
System.out.println("VectorNeighbors result for cluster 1 query: " + result.next().toJSON());
});
// Test 3: Test vectorNeighbors function with different k value
database.transaction(() -> {
final float[] queryVector = new float[128];
queryVector[2] = 1.0f; // Similar to products 20-29
final var result = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', ?, 10) as neighbors FROM Product LIMIT 1",
queryVector);
assertThat(result.hasNext()).as("Query should return results").isTrue();
final var doc = result.next();
System.out.println("VectorNeighbors result for cluster 2 query (k=10): " + doc.toJSON());
});
// Test 4: Query with specific product and find its nearest neighbors
database.transaction(() -> {
final var result = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', embedding, 3) as neighbors " +
"FROM Product WHERE name = 'Product_15'");
assertThat(result.hasNext()).as("Query should return results").isTrue();
final var doc = result.next();
System.out.println("Neighbors of Product_15: " + doc.toJSON());
// Product_15 should be similar to other products in the 10-19 range
final String productName = doc.getProperty("name");
assertThat(productName).isEqualTo("Product_15");
});
// Test 5: Verify multiple queries work correctly
database.transaction(() -> {
final float[] queryVector1 = new float[128];
queryVector1[0] = 1.0f;
final var result1 = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', ?, 3) as neighbors FROM Product LIMIT 1",
queryVector1);
assertThat(result1.hasNext()).as("First query should return results").isTrue();
System.out.println("Query 1 result: " + result1.next().toJSON());
final float[] queryVector2 = new float[128];
queryVector2[1] = 1.0f;
final var result2 = database.query("sql",
"SELECT name, vectorNeighbors('Product[embedding]', ?, 3) as neighbors FROM Product LIMIT 1",
queryVector2);
assertThat(result2.hasNext()).as("Second query should return results").isTrue();
System.out.println("Query 2 result: " + result2.next().toJSON());
});
System.out.println("✓ All SQL vectorNeighbors tests passed!");
}
I left the office 10 mins ago, will test tomorrow !
Maybe it's related to my build ?
Everytime I update my arcadedb, i just wipe ./lib/ and replace it with the new one
Wel shit @lvca I just tested on windows, fresh build, literally HEAD, freshly built right from the built folder, no /lib/ copy paste.
Still doesn't work, maybe it's because nodes are created using CYPHER and/or before index is created ?
Is there a way you can upload this database as a zip? Or a test case to reproduce the same content?