langchainjs icon indicating copy to clipboard operation
langchainjs copied to clipboard

Weaviate Vectore Store: Error "no graphql provider present, this is most likely because no schema is present. Import a schema first!"

Open heresandyboy opened this issue 1 year ago • 2 comments

I am trying to get started with Weaviate via LangChainJS, for the first time, and I am migrating my code from using Pinecone.

In my code below I am creating some documents from a PDF and using embeddings from open AI to import the documents to Weaviate. I am just running Weaviate in docker locally with a basic default docker-compose from their generator in their docs site.

I get no errors at all from WeaviateStore.fromDocuments but I can not retrieve any data as it seems no schema has been generated after importing the data.

I get the error from store.similaritySearch: no graphql provider present, this is most likely because no schema is present. Import a schema first

'Error in similaritySearch' Error: GraphQL Error (Code: 422): 
{"response":
{"error":[{"message":"no graphql provider present, this is most likely because no schema is present. 
Import a schema first!"}],"status":422,"headers":{}},
"request":{"query":"{Get{pdf-test(nearVector:{vector:
[-0.025541332,-0.013953761,0.01518418,-0.012939679,-0.014792069,0.007930117,-0.023
...

It is my assumption that schemas are auto generated based on the data I put into Weaviate, sic : https://weaviate.io/developers/weaviate/configuration/schema-configuration#auto-schema

Here is my code, if anyone can help me figure out what I am doing wrong:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { WeaviateStore } from 'langchain/vectorstores/weaviate';
import { client } from '@/utils/weaviate-client';
import { CustomPDFLoader } from '@/utils/customPDFLoader';
import { WEAVIATE_INDEX_NAME } from '@/config/weaviate';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

/* Name of directory to retrieve your files from */
const filePath = 'public/docs';

export const run = async () => {
  try {
    /*load raw docs from the all files in the directory */
    const directoryLoader = new DirectoryLoader(filePath, {
      '.pdf': (path) => new CustomPDFLoader(path),
    });

    // const loader = new PDFLoader(filePath);
    const rawDocs = await directoryLoader.load();

    /* Split text into chunks */
    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const docs = await textSplitter.splitDocuments(rawDocs);
    console.log('split docs', docs);

    console.log('creating vector store...');
    /*create and store the embeddings in the vectorStore*/
    const embeddings = new OpenAIEmbeddings();

    //embed the PDF documents
    const store = await WeaviateStore.fromDocuments(docs, embeddings, {
      client,
      indexName: WEAVIATE_INDEX_NAME,
      textKey: 'text',
      metadataKeys: ['source', 'pdf_numpages', 'loc'],
    });

    const results = await store.similaritySearch('broadband', 1);
    console.log(results);
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('ingestion complete');
})();

Here is my docker compose if it helps:

---
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: semitechnologies/weaviate:1.18.3
    ports:
    - 8080:8080
    restart: on-failure:0
    environment:
      OPENAI_APIKEY: **REDACTED**
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-openai'
      ENABLE_MODULES: 'text2vec-openai'
      CLUSTER_HOSTNAME: 'node1'
      AUTOSCHEMA_ENABLED: 'true'
    volumes:
      - ./weaviate:/var/lib/weaviate
...

Here are the relevant env vars:

WEAVIATE_INDEX_NAME=pdf-test
WEAVIATE_SCHEME=http
WEAVIATE_HOST=127.0.0.1:8080

heresandyboy avatar Apr 21 '23 15:04 heresandyboy

With some help from the Weaviate community support slack channel, I have managed to get this to work. However I think there are a few issues that need to be resolved.

In short, data was not actually being loaded due to several requirements not being met, class naming restrictions, metadata naming restrictions and objects not allowed in the metadata values, there are no errors to suggest any of this is happening.

  1. We receive no errors in langchainjs if data fails to load, for example WeaviateStore.fromDocuments, may fail for any of these reasons, but there is no feedback whatsoever when it does, neither from langchain or in the logs of a Weaviate container.:

  2. Unlike every other vector store I have tried recently, Weaviate has a restrictive naming policy, on both index/class names and a different policy for metadata, Metadata field names allow characters not allowed in the index/class names, and does not allow objects as metadata values.

    a) Class/index name regex - /^[A-Z][_0-9A-Za-z]*$/ - im my case all my indexes had a - char in them and were lower case. I had to change pdf-test to PdfTest to get it to work.

    b) metadata property name regex: /[_A-Za-z][_0-9A-Za-z]*/ - my property names were fine, e.g pdf_numpages, but I found I had to include all my metadata on the first ingestion, unlike pinecone I could not just add new metadata later on. I had to add chat_history on first ingestion, whereas with Pinecone I was adding it later on in my app via WeaviateStore.fromExistingIndex.

    c) Objects are not allowed in metadata, so it does not support Record<string, any> this means that the default results of a DocumentLoader, which can include metadata.loc.lines.from and metadata.loc.lines.to in an object needs to be flattened before ingestion.

    d) Cannot have a metadata prop name of id

Some of this information can be found in the Weaviate docs here: https://weaviate.io/developers/weaviate/configuration/schema-configuration

@JHeidinga @nfcampos I would appreciate your input on this

heresandyboy avatar Apr 25 '23 08:04 heresandyboy

Hi Andy,

I think you are thorough and completely right. When the objects endpoint of Weaviate is used, one should manually check for errors. Next to that when the openai vectorisation module is enabled rate limiting should be applied as wel:

private async persistObjects(batch: WeaviateObject[], rateLimitDelay = 1000) { const chunkSize = 50; const data: WeaviateObject[] = [] for (let i = 0; i < batch.length; i += chunkSize) { const chunk = batch.slice(i, i + chunkSize); // do whatever const result = await this.client.batch .objectsBatcher() .withConsistencyLevel("ALL") .withObjects(...chunk) .do() for (const entry of result) { if (entry?.result?.errors) { throw new Error(JSON.stringify(entry?.result?.errors?.error, null, 2)) } } data.push(...result) if (rateLimitDelay > 0) { console.log("Waiting for rate limit delay...") await this.rateLimitDelay(1000) } } return data }

On 25 Apr 2023, at 10:05, Andy Ainsworth @.***> wrote:



With some help from the Weaviate community support slack channel, I have managed to get this to work. However I think there are a few issues that need to be resolved.

In short, data was not actually being loaded due to several requirements not being met, class naming restrictions, metadata naming restrictions and objects not allowed in the metadata values, there are no errors to suggest any of this is happening.

  1. We receive no errors in langchainjs if data fails to load, for example WeaviateStore.fromDocuments, may fail for any of these reasons, but there is no feedback whatsoever when it does, neither from langchain or in the logs of a Weaviate container.:

  2. Unlike every other vector store I have tried recently, Weaviate has a restrictive naming policy, on both index/class names and a different policy for metadata, Metadata field names allow characters not allowed in the index/class names, and does not allow objects as metadata values.

a) Class/index name regex - /^[A-Z][_0-9A-Za-z]*$/ - im my case all my indexes had a - char in them and were lower case. I had to change pdf-test to PdfTest to get it to work.

b) metadata property name regex: /[_A-Za-z][_0-9A-Za-z]*/ - my property names were fine, e.g pdf_numpages, but I found I had to include all my metadata on the first ingestion, unlike pinecone I could not just add new metadata later on. I had to add chat_history on first ingestion, whereas with Pinecone I was adding it later on in my app via WeaviateStore.fromExistingIndex.

c) Objects are not allowed in metadata, so it does not support Record<string, any> this means that the default results of a DocumentLoader, which can include metadata.loc.lines.from and metadata.loc.lines.to in an object needs to be flattened before ingestion.

Some of this information can be found in the Weaviate docs here: https://weaviate.io/developers/weaviate/configuration/schema-configuration

@JHeidingahttps://github.com/JHeidinga I would appreciate your input on this

— Reply to this email directly, view it on GitHubhttps://github.com/hwchase17/langchainjs/issues/926#issuecomment-1521347133, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAM7Q7YJAOEMCHMFOFNAIP3XC6AUPANCNFSM6AAAAAAXHBDGJQ. You are receiving this because you were mentioned.Message ID: @.***>

JHeidinga avatar Apr 25 '23 13:04 JHeidinga