ask-astro icon indicating copy to clipboard operation
ask-astro copied to clipboard

Need to specify tokenization for content

Open mpgreg opened this issue 1 year ago • 1 comments

https://github.com/astronomer/ask-astro/blob/c45487c7f12a9424dbe885580c687e35e30b7de4/airflow/include/data/schema.json#L54

Without specifying a tokenization scheme ingest will default to word as per https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization. This will split snake-case configuration parameters and environment variables treating underscore as whitespace.

Example as per https://github.com/weaviate/weaviate/blob/764935fe4b576c87750d6a16ea20fd6c349b20b8/adapters/repos/db/helpers/tokenizer.go#L67

func main() {
	in := "THIS is my_env_variable"

	fmt.Print("\nwhitespace")
	fmt.Print(tokenizeWhitespace(in))
	fmt.Print("\nlowercase")
	fmt.Print(tokenizeLowercase(in))
	fmt.Print("\nword")
	fmt.Print(tokenizeWord(in))
	fmt.Print("\nwildcards")
	fmt.Print(tokenizeWordWithWildcards(in))

}

Results in...

whitespace[THIS is my_env_variable]
lowercase[this is my_env_variable]
word[this is my env variable]
wildcards[this is my env variable]

To prevent splitting of snake-case words or to lose camel-case params we need to switch to whitespace.

mpgreg avatar Nov 13 '23 12:11 mpgreg