langchainjs icon indicating copy to clipboard operation
langchainjs copied to clipboard

Astra DB - collection bug

Open jinchi2013 opened this issue 1 year ago • 11 comments

Checked other resources

  • [X] I added a very descriptive title to this issue.
  • [X] I searched the LangChain.js documentation with the integrated search.
  • [X] I used the GitHub search to find a similar question and didn't find it.
  • [X] I am sure that this is a bug in LangChain.js rather than my code.
  • [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import {
  AstraDBVectorStore,
  AstraLibArgs,
} from '@langchain/community/vectorstores/astradb'
import { formatDocumentsAsString } from 'langchain/util/document'
import { GoogleVertexAIEmbeddings } from '@langchain/community/embeddings/googlevertexai'

const VertexAIEmbeddings = new GoogleVertexAIEmbeddings()

const getAstraDBRetriever = async () => {
  const astraConfig: AstraLibArgs = {
    token: '{{token}}',
    endpoint: '{{endpoint}}',
    collection: '{{collection}}',
  }

  const store = await AstraDBVectorStore.fromExistingIndex(VertexAIEmbeddings, astraConfig)

  return store.asRetriever({
    k: topK,
    searchType: 'similarity',
    metadata,
  }).pipe(formatDocumentsAsString)
}

Error Message and Stack Trace (if applicable)

"Collection 'default_keyspace.my_testing_collection' already exists CollectionAlreadyExistsError: Collection 'default_keyspace.my_testing_collection' already exists\n at Db.createCollection (/Users/dev/github/web/common/temp/node_modules/.pnpm/@[email protected]/node_modules/@datastax/astra-db-ts/dist/data-api/db.js:343:23)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async AstraDBVectorStore.initialize (/Users/dev/github/web/common/temp/node_modules/.pnpm/@[email protected]_@[email protected][email protected]/node_modules/@langchain/community/dist/vectorstores/astradb.cjs:105:9)\n at async AstraDBVectorStore.fromExistingIndex (/Users/dev/github/web/common/temp/node_modules/.pnpm/@[email protected]_@[email protected][email protected]/node_modules/@langchain/community/dist/vectorstores/astradb.cjs:271:9)"

Description

Trying to use Astra as the vector store. But it can not be initialize. The bug is coming from here: https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/vectorstores/astradb.ts#L94 When the collectionName is already existing, instead of going to next line to connect to the collection, it will throw an error, and stop the connection process.

System Info

"@langchain/community": "~0.0.49",
"@datastax/astra-db-ts": "~1.0.1",

jinchi2013 avatar Apr 17 '24 23:04 jinchi2013

To address the CollectionAlreadyExistsError when initializing Astra as the vector store in LangChain.js, you'll need to modify the initialize method in the AstraDBVectorStore class. The modification involves adding a check to see if the collection already exists before attempting to create it. If the collection exists, the method should connect to it instead of trying to create a new one. This can be done by catching the error that indicates the collection already exists and then connecting to the existing collection.

Here's a suggested modification for the initialize method:

async initialize(): Promise<void> {
  try {
    await this.astraDBClient.createCollection(
      this.collectionName,
      this.collectionOptions
    );
  } catch (error) {
    if (error instanceof CollectionAlreadyExistsError) {
      console.debug("Collection already exists, connecting to it.");
    } else {
      throw error; // Rethrow if it's not the specific error we're looking for
    }
  }
  this.collection = await this.astraDBClient.collection(this.collectionName);
  console.debug("Connected to Astra DB collection");
}

Ensure that CollectionAlreadyExistsError is correctly identified based on the actual error handling provided by @datastax/astra-db-ts. Adjust the catch block accordingly if the error class or mechanism differs.

This approach should help you bypass the error and connect to an existing collection without attempting to recreate it.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Apr 17 '24 23:04 dosubot[bot]

Hey @jinchi2013 you can by-pass the collection check by setting checkExists: false

const astraConfig: AstraLibArgs = {
    token: '{{token}}',
    endpoint: '{{endpoint}}',
    collection: '{{collection}}',
    collectionOptions: {
        checkExists: false
      }
  }

Also note that you're creating the collection without specifying the vector options so it won't be able to store embeddings. You can see the collectionOptions and a full example in the LangChainJS documentation for AstraDB

nicoloboschi avatar Apr 18 '24 06:04 nicoloboschi

@nicoloboschi Thank you. I don't want create a new collection. I want to use the existing collection. I tested the checkExists: false, and I'm getting a new error now.

Invalid collection name: provided collection ({{ collection }}) already exists with different collection options DataAPIResponseError: Invalid collection name: provided collection ({{ collection }}) already exists with different collection options

jinchi2013 avatar Apr 18 '24 14:04 jinchi2013

@nicoloboschi Thank you. I don't want create a new collection. I want to use the existing collection. I tested the checkExists: false, and I'm getting a new error now.

Invalid collection name: provided collection ({{ collection }}) already exists with different collection options DataAPIResponseError: Invalid collection name: provided collection ({{ collection }}) already exists with different collection options

@jinchi2013 it's likely that you changed the collection options since the first time you created the collection. Since you're using the VertexAIEmbeddings you need to set the dimension on the collectionOptions and to enable the vector column. To do that, you need to change the code in this way:

const astraConfig: AstraLibArgs = {
    token: '{{token}}',
    endpoint: '{{endpoint}}',
    collection: '{{collection}}',
    collectionOptions: {
        checkExists: false,
        vector: {
          dimension: 768, // this is the n. of textembedding-gecko dimensions
          metric: "cosine",
        },
      }
  }

I'd suggest you to start over (delete the table from the UI) and run the code again.

nicoloboschi avatar Apr 18 '24 14:04 nicoloboschi

@nicoloboschi I didn't create this collection. The collection is already there and it is using to other purpose also. I don't think delete and recreate is the option here. There is no way to me access a existing collection?

jinchi2013 avatar Apr 18 '24 14:04 jinchi2013

@jinchi2013 it looks like you need to match up the embeddings model that was used to create the vector store and then the collection options will match. Try setting the embeddings model specifically like this:

VertexAIEmbeddings(model_name="textembedding-gecko")

CharnaParkey avatar Apr 18 '24 22:04 CharnaParkey

@CharnaParkey VertexAI use textembedding-gecko as its default option. See here: https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/embeddings/googlevertexai.ts#L76

jinchi2013 avatar Apr 19 '24 03:04 jinchi2013

@jinchi2013 The collection has been created with the wrong configuration and cannot be used for vertexAIEmbeddings. My suggestion is to ask to your colleague that created the table to use the above collectionOptions.

They can use whatever method is supported but the collectionOptions have to be the same I posted in the comment.

After that, you can safely run the code suggested

nicoloboschi avatar Apr 19 '24 19:04 nicoloboschi

I can use Python version of langchain to access the same collection. What makes js version so special?

jinchi2013 avatar Apr 19 '24 19:04 jinchi2013

There is not issue with below Python code

embeddings = VertexAIEmbeddings(model_name="textembedding-gecko")

vector_store = AstraDBVectorStore(
  token=os.getenv("TOKEN"),
  api_endpoint=os.getenv("ENDPOINT"),
  collection_name="{{ collection_name }}",
  embedding=embeddings
)

vector_store.similarity_search("search something", k=3)

jinchi2013 avatar Apr 19 '24 19:04 jinchi2013

This is now resolved by https://github.com/langchain-ai/langchainjs/pull/5142 https://github.com/langchain-ai/langchainjs/pull/5170 https://github.com/langchain-ai/langchainjs/pull/5185

nicoloboschi avatar May 14 '24 08:05 nicoloboschi