langstream
langstream copied to clipboard
Text normalizer should remove line breaks
When doing an HTML crawl, the text ends up with a log of line breaks in it. These have no semantic value and waste tokens in both the text chunks and in the LLM calls. Here is an example of what we get back when doing a similarity search:
[{\"role\":\"system\",\"content\":\"You are a helpful assistant for the LangStream project. \\nDo not answer questions not related to the LangStream project.\\n\\nA user is going to ask a questions. Refer to these documents \\nwhen answering to their questions. Use them as much as possible\\nwhen answering the question. If you do not know the answer, say so.\\n\\ntext-normaliser\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n text-extractor\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n text-splitter\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n input & output\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n configuration resources\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n data storage\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n large language models (llms)\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n instance clusters\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n streaming\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n compute\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n \\n \\n \\n powered by gitbook\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n text-normaliser\\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n this is an agent that applies specific transformations on text.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n example\\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n an example that lowercases the provided contents:\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n - name: "normalise text"\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n type: "text-normaliser"\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n configuration:\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n make-lowercase: true\\n \\n\\n \\n\\n \\n \\n \\n\\n \\n\\n \\n trim-spaces: true\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n with the configuration above and an input of "hi there with a trailing space ", the output is hi there with a trailing space.\\ntrim empty spaces from each line of text.\\n \\n\\n \\n\\n \\n \\n ?\\n \\n\\n \\n\\n \\n \\n defaults to a value of ?true?\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n query-vector-db\\n \\n\\n \\n\\n \\n \\n \\n next\\n \\n\\n \\n\\n \\n text-extractor\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 8d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\nthe text to use.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n ai-chat-completions\\n \\n\\n \\n\\n \\n \\n \\n next\\n \\n\\n \\n\\n \\n language-detector\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 14d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\nconfiguration\\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n \\n label\\n \\n\\n \\n \\n \\n \\n type\\n \\n\\n \\n \\n \\n \\n description\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n chunk-size\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n integer (optional)\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n the number of characters to break a document?s contents into.\\n \\n\\n \\n\\n \\n \\n ?\\n \\n\\n \\n\\n \\n \\n default to a value of 1000 characters.\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n\\n \\n \\n \\n previous\\n \\n\\n \\n\\n \\n text-extractor\\n \\n\\n \\n\\n \\n \\n \\n next - pipeline agents\\n \\n\\n \\n\\n \\n input & output\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 8d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\ncustom agents\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n last modified 18d ago\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n \\n \\n on this page\\n \\n\\n \\n\\n \\n \\n example\\n \\n\\n \\n\\n \\n \\n topics\\n \\n\\n \\n\\n \\n \\n configuration\\n\"},```
This is the configuration of my web crawler and normalizer:
- name: "Crawl the WebSite"
type: "webcrawler-source"
configuration:
seed-urls: "{{{globals.seedUrls}}}"
allowed-domains: "{{globals.allowedUrls}}"
min-time-between-requests: 500
max-unflushed-pages: 100
user-agent: "langstream.ai-webcrawler/1.0"
bucketName: "{{{secrets.s3-credentials.bucket-name}}}"
endpoint: "{{{secrets.s3-credentials.endpoint}}}"
access-key: "{{{secrets.s3-credentials.access-key}}}"
secret-key: "{{{secrets.s3-credentials.secret}}}"
region: "{{{secrets.s3-credentials.region}}}"
idle-time: 5
- name: "Extract text"
type: "text-extractor"
- name: "Normalise text"
type: "text-normaliser"
configuration:
make-lowercase: true
trim-spaces: true
```
Can you please try again ? we committed a few improvements in this area