langstream icon indicating copy to clipboard operation
langstream copied to clipboard

Text normalizer should remove line breaks

Open cdbartholomew opened this issue 1 year ago • 2 comments

When doing an HTML crawl, the text ends up with a log of line breaks in it. These have no semantic value and waste tokens in both the text chunks and in the LLM calls. Here is an example of what we get back when doing a similarity search:

[{\"role\":\"system\",\"content\":\"You are a helpful assistant for the LangStream project. \\nDo not answer questions not related to the LangStream project.\\n\\nA user is going to ask a questions. Refer to these documents \\nwhen answering to their questions. Use them as much as possible\\nwhen answering the question. If you do not know the answer, say so.\\n\\ntext-normaliser\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n text-extractor\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n text-splitter\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n input & output\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n configuration resources\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n data storage\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n large language models (llms)\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n instance clusters\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n streaming\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n compute\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n \\n \\n \\n powered by gitbook\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n text-normaliser\\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n this is an agent that applies specific transformations on text.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n example\\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n an example that lowercases the provided contents:\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n - name: "normalise text"\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n type: "text-normaliser"\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n configuration:\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n make-lowercase: true\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n trim-spaces: true\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n with the configuration above and an input of "hi there with a trailing space ", the output is hi there with a trailing space.\\ntrim empty spaces from each line of text.\\n \\n\\n \\n\\n \\n \\n ?\\n \\n\\n \\n\\n \\n \\n defaults to a value of ?true?\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n query-vector-db\\n \\n\\n \\n\\n \\n \\n \\n next\\n \\n\\n \\n\\n \\n text-extractor\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 8d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\nthe text to use.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n ai-chat-completions\\n \\n\\n \\n\\n \\n \\n \\n next\\n \\n\\n \\n\\n \\n language-detector\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 14d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\nconfiguration\\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n label\\n \\n\\n \\n \\n \\n \\n type\\n \\n\\n \\n \\n \\n \\n description\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n chunk-size\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n integer (optional)\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n the number of characters to break a document?s contents into.\\n \\n\\n \\n\\n \\n \\n ?\\n \\n\\n \\n\\n \\n \\n default to a value of 1000 characters.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n text-extractor\\n \\n\\n \\n\\n \\n \\n \\n next - pipeline agents\\n \\n\\n \\n\\n \\n input & output\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 8d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\ncustom agents\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 18d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\n\"},```

cdbartholomew avatar Aug 30 '23 21:08 cdbartholomew

This is the configuration of my web crawler and normalizer:

  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls: "{{{globals.seedUrls}}}"
      allowed-domains: "{{globals.allowedUrls}}"
      min-time-between-requests: 500
      max-unflushed-pages: 100
      user-agent: "langstream.ai-webcrawler/1.0"
      bucketName: "{{{secrets.s3-credentials.bucket-name}}}"
      endpoint: "{{{secrets.s3-credentials.endpoint}}}"
      access-key: "{{{secrets.s3-credentials.access-key}}}"
      secret-key: "{{{secrets.s3-credentials.secret}}}"
      region: "{{{secrets.s3-credentials.region}}}"
      idle-time: 5
  - name: "Extract text"
    type: "text-extractor"
  - name: "Normalise text"
    type: "text-normaliser"
    configuration:
      make-lowercase: true
      trim-spaces: true
      ```

cdbartholomew avatar Aug 30 '23 21:08 cdbartholomew

Can you please try again ? we committed a few improvements in this area

eolivelli avatar Sep 20 '23 10:09 eolivelli