Import of the fasttext vector embeddings fails due to large transaction sizes
I am using version v.25.6.1 and tried to import vector embeddings using the information from the documentation but run into issue. I created a new database and tried the import of the fasttest embeddings from a local directory:
import database file:///arcadedb/cc.en.300.vec.gz with distanceFunction=cosine, m=8, ef=64, efConstruction=64, vertexType=Word, edgeType=Proximity, vectorProperty=Float, idProperty=name;
Here is the stack trace of the exception:
com.arcadedb.exception.CommandExecutionException: Error on importing database at com.arcadedb.query.sql.parser.ImportDatabaseStatement.executeSimple(ImportDatabaseStatement.java:73) at com.arcadedb.query.sql.executor.SingleOpExecutionPlan.executeInternal(SingleOpExecutionPlan.java:92) at com.arcadedb.query.sql.parser.SimpleExecStatement.execute(SimpleExecStatement.java:56) at com.arcadedb.query.sql.parser.Statement.execute(Statement.java:65) at com.arcadedb.query.sql.SQLQueryEngine.command(SQLQueryEngine.java:119) at com.arcadedb.database.LocalDatabase.command(LocalDatabase.java:1338) at com.arcadedb.console.Console.executeSQL(Console.java:632) at com.arcadedb.console.Console.execute(Console.java:294) at com.arcadedb.console.Console.parse(Console.java:787) at com.arcadedb.console.Console.interactiveMode(Console.java:150) at com.arcadedb.console.Console.execute(Console.java:207) at com.arcadedb.console.Console.main(Console.java:167) Caused by: com.arcadedb.integration.importer.ImportException: Error on parsing source 'file:///arcadedb/cc.en.300.vec.gz (compressed=true size=1325960915)' at com.arcadedb.integration.importer.Importer.load(Importer.java:68) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at com.arcadedb.query.sql.parser.ImportDatabaseStatement.executeSimple(ImportDatabaseStatement.java:62) ... 11 more Caused by: com.arcadedb.integration.importer.ImportException: Error on importing Word2Vec datasource at com.arcadedb.integration.importer.format.Word2VecImporterFormat.load(Word2VecImporterFormat.java:55) at com.arcadedb.integration.importer.Importer.loadFromSource(Importer.java:107) at com.arcadedb.integration.importer.Importer.load(Importer.java:53) ... 14 more Caused by: com.arcadedb.exception.TransactionException: Transaction error on commit at com.arcadedb.database.TransactionContext.commit1stPhase(TransactionContext.java:652) at com.arcadedb.database.TransactionContext.commit(TransactionContext.java:127) at com.arcadedb.database.LocalDatabase.lambda$commit$2(LocalDatabase.java:397) at com.arcadedb.database.LocalDatabase.executeInReadLock(LocalDatabase.java:1432) at com.arcadedb.database.LocalDatabase.commit(LocalDatabase.java:392) at com.arcadedb.index.vector.HnswVectorIndex.build(HnswVectorIndex.java:1002) at com.arcadedb.schema.VectorIndexBuilder.create(VectorIndexBuilder.java:122) at com.arcadedb.schema.VectorIndexBuilder.create(VectorIndexBuilder.java:42) at com.arcadedb.integration.importer.vector.TextEmbeddingsImporter.run(TextEmbeddingsImporter.java:150) at com.arcadedb.integration.importer.format.Word2VecImporterFormat.load(Word2VecImporterFormat.java:52) ... 16 more Caused by: com.arcadedb.exception.TransactionException: Transaction buffer bigger than 2.00GB. Split the big transaction in smaller transactions. This transaction will be roll backed at com.arcadedb.engine.WALFile.writeTransactionToBuffer(WALFile.java:231) at com.arcadedb.engine.TransactionManager.createTransactionBuffer(TransactionManager.java:151) at com.arcadedb.database.TransactionContext.commit1stPhase(TransactionContext.java:640) ... 25 more
After briefly looking through the code, I found that the transaction size is fixed and not configurable. In the create method of VectorIndexBuilder the batch size is set to a fixed value:
index.build(origin, LocalSchema.BUILD_TX_BATCH_SIZE, vertexCreationCallback, callback);
After decreasing the parameter to a smaller size, I was able to get it to work. Of course, I may be overlooking something so any help would be appreciated. I know that there is interest in adding support for other vector indexing approaches/implementations through jvector but I didn't see any updates on that.
@odysseaspenta please use the new LSM Vector instead, it's available in the main branch.