ask-astro
ask-astro copied to clipboard
Need to specify tokenization for content
https://github.com/astronomer/ask-astro/blob/c45487c7f12a9424dbe885580c687e35e30b7de4/airflow/include/data/schema.json#L54
Without specifying a tokenization scheme ingest will default to word
as per https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization. This will split snake-case configuration parameters and environment variables treating underscore as whitespace.
Example as per https://github.com/weaviate/weaviate/blob/764935fe4b576c87750d6a16ea20fd6c349b20b8/adapters/repos/db/helpers/tokenizer.go#L67
func main() {
in := "THIS is my_env_variable"
fmt.Print("\nwhitespace")
fmt.Print(tokenizeWhitespace(in))
fmt.Print("\nlowercase")
fmt.Print(tokenizeLowercase(in))
fmt.Print("\nword")
fmt.Print(tokenizeWord(in))
fmt.Print("\nwildcards")
fmt.Print(tokenizeWordWithWildcards(in))
}
Results in...
whitespace[THIS is my_env_variable]
lowercase[this is my_env_variable]
word[this is my env variable]
wildcards[this is my env variable]
To prevent splitting of snake-case words or to lose camel-case params we need to switch to whitespace
.