contextive FR: Umlauts support

Usually in programming you do not use umlauts in code, but in descriptions to terms that might be apropriate.

so in code you could have something like Ausloeser but similar to the plural detection this should reference to Auslöser in the glossary

Mar 27 '25 09:03 sschneider-ihre-pvs

Interesting idea - I can see the value. I will take a look at how oe could be considered equivalent ö for the purposes of displaying a hover.

In the meantime, as a workaround, you could define an alias, e.g.

   - name: Auslöser
     aliases:
       - Ausloeser

which would, I think, give the effect you're looking for.

Mar 27 '25 10:03 chrissimon-au

Here is how you would do it in Java:

String normalize(String s) {
  return Normalizer.normalize(s, Normalizer.Form.NFKD)
      .replaceAll("\\p{M}", "")
      .replaceAll("ß", "s")
      .replaceAll("ẞ", "s");
}

void main() {
  var tests = List.of("Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka");
  for (var t : tests) {
    println(String.format("%s -> %s", t, normalize(t)));
  }
}

and in Go:

package main

import (
	"fmt"
	"unicode"

	"golang.org/x/text/runes"
	"golang.org/x/text/transform"
	"golang.org/x/text/unicode/norm"
)

func main() {
	tests := []string{"Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka"}
	for _, t := range tests {
		n, _ := normalize(t)
		fmt.Printf("%s -> %s\n", t, n)
	}
}

func normalize(s string) (string, error) {
	t := transform.Chain(norm.NFD,
		runes.Remove(runes.In(unicode.Mn)),
		runes.Map(func(r rune) rune {
			switch r {
			case 'ß':
				return 's'
			case 'ẞ':
				return 's'
			}
			return r
		}),
		norm.NFC)
	r, _, err := transform.String(t, s)
	if err != nil {
		return "", err
	}
	return r, nil
}

Their output:

Hallo -> Hallo
Übergrößengeschäft -> Ubergrosengeschaft
mañana es sábado -> manana es sabado
kir à l’aÿ -> kir a l’ay
kočka -> kocka

Note the special case for ß and ẞ: They are proper letters instead of diacritical marks on a letter.

Apr 10 '25 11:04 sdavids

Thanks @sdavids for the pointers, they were very helpful. In this case however, I do need to handle it slightly differently, as @sschneider-ihre-pvs 's request was for Ausloeser in code to match Auslöser as a defined term. The scripts above remove the combining mark resulting in Ausloser.

However, in languages other than german, simply removing the combining mark works fine.

To that end, I've been working on a solution that tests both and will match either the case with the combining mark removed, or specially handled cases for the german umlaut where it is replaced with an e, e.g. ö becoming oe. (The contextive philosophy is generally to match more loosely than strictly.)

So with this change, if the code contained ausloser, ausloeser, OR auslöser they would all show the definitions of a term defined with Auslöser.

And in other languages, e.g. French, pere OR père would match a term defined as Père

I hope this suits your needs, @sschneider-ihre-pvs ?

Regarding the ß character, I understand it is commonly replaced with ss. Is that your experience? To that end it's now only matching with ss, e.g. strasse or straße would match a term defined as Straße, but strase would not. Is that acceptable?

You can see a preview of the proposed documentation change to describe this feature here: https://docs.test.contextive.tech/community/v/c961ccc/guides/defining-terminology/#unicode-and-diacritics - feedback welcome!

The test cases are here: https://github.com/dev-cycles/contextive/blob/main/src/language-server/Contextive.LanguageServer.Tests/E2e/HoverTests.fs#L84

May 17 '25 06:05 chrissimon-au

Note: You forgot the capital ẞ:

https://github.com/dev-cycles/contextive/blob/c961ccc9bffe95954a37cbfae4a85f5ee7a3a18e/src/core/Contextive.Core/GlossaryFile.fs#L39

May 17 '25 17:05 sdavids

A German test case:

Noun "Größe" (size)

Größe - correct spelling Groesse - someone typing on a keyboard w/out German letters or someone being lazy groesse - someone typing on a keyboard w/out German letters and not caring about capitalization or someone being lazy größe - lower case or not caring about capitalization GRÖSSE - old upper case spelling GRÖẞE - new (since 2017) upper case spelling with ẞ GRÖßE - incorrect spelling with lower case ß

All 7 variants would be considered the same word by a German (most Germans do not know about the capital ẞ though 😂).

It get's hairy:

Noun "Masse" (mass) Noun "Maße" (dimensions)

with the old upper case spelling:

Masse ⇒ MASSE Maße ⇒ MASSE

with the new upper case spelling:

Masse ⇒ MASSE Maße ⇒ MAẞE

Uppercase conversion with the old rules is irreversible in German—MASSE could have been derived from two distinct words "Masse" or "Maße".

In this case one cannot use ss and ß interchangeably because that would change the meaning.

Other languages have similar quirks.

I favor a Pareto solution instead of a 100% solution.

May 17 '25 17:05 sdavids

Regarding the ß character, I understand it is commonly replaced with ss.

German orthography has changed quite a bit in recent years; especially with the Reform der deutschen Rechtschreibung 1996—ss vs. ß was an important (and at that time contentious) part of it.

To that end it's now only matching with ss, e.g. strasse or straße would match a term defined as Straße, but strase would not. Is that acceptable?

See above.

A German would know what strase means though.

People with German as a second language might also write it that way because they are not familiar with the intricacies of ss vs. ß yet.

Some use s, ss, and ß interchangeably in colloquial text—to the horror of German Grammatiknazis.

May 17 '25 18:05 sdavids

Note: You forgot the capital ẞ:

contextive/src/core/Contextive.Core/GlossaryFile.fs

Line 39 in c961ccc

s.Replace("\u0308", "e").Replace("ß", "ss")

Thanks @sdavids - I initially excluded it because we do an IgnoreCase comparison so thought it wouldn't matter, but your comment helped me realise that the IgnoreCase comparison is after the normalisation, so it does need doing explicitly. Added a test case and support for this now.

May 17 '25 22:05 chrissimon-au

With the new implementation, having experimented a bit, this is how it works over a few scenarios, hopefully it makes sense:

`ß` in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: Masse
        definition: Mass (English)
      - name: Maße
        definition: dimensions (English)

And the code contains masse then the following hover appears, showing both options, as we can't disambiguate:

If the code contains maße then the following hover appears, showing only the Maße option:

`SS` in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: MASSE
        definition: Mass (English)
      - name: MASSE
        definition: dimensions (English)

And the code contains masse then again, both defnitions are shown:

If the code contains maße then nothing is shown. We don't reverse SS from the glossary file to match ß or ẞ in the code. From everything above, it seems unlikely that this would happen - if ß or ẞ are in the code, then they would also be used in the definitions file.

`ẞ` in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: MASSE
        definition: Mass (English)
      - name: MAẞE
        definition: dimensions (English)

And the code contains masse then both options are shown:

If the code contains maße then only MAẞE is shown:

Summary

Does this follow the most common expectations in your experience?

May 17 '25 22:05 chrissimon-au

In a German software project…

Decide on the language used in source code/config files

everything in German

“Autofabrik”, “Personenlager” - correct spelling
“AutoFabrik”, “PersonenLager” - incorrect spelling but better IDE and find/replace DX
"FabrikAuto", "LagerPersonen" - incorrect spelling but better sort DX

everything in German with Anglicisms

“Autofactory” , Personenrepository” - correct spelling but feels weird to use for a German
“AutoFactory” , PersonenRepository” - incorrect spelling but better IDE and find/replace and DX
“FactoryAuto”, “RepositoryPersonen” - incorrect spelling but better sorting DX

everything in English—open to non-German speaking team members and a possible company/project merger in the future

"CarFactory", "PersonRepository"
"FactoryCar", "RepositoryPerson" - better sort DX

Another consideration is the programming language one is using.

Some do not support non-ASCII identifiers (Ruby) or used to (Rust, Python), in that case one has to decide on how to handle äöüßÄÖÜẞ: äöüßÄÖÜẞ⇒ae,oe,ue,ss,Ae,Oe,Ue,SS or äöüßÄÖÜẞ⇒a,o,u,s,A,O,U,S or go with the “everything in English” option for that programming language only.

Decide on the language used in acceptance/BDD tests

everything in German
source code in English, specifications in German
everything in English

Decide on the language used in documentation

everything in German
- if the decision was made to use English in source code/config files then usually a document (wiki page) with canonical translations is created, i.e. German “Auto” is always translated as “Car” and not “Auto”, and source code snippet are exempt from the "everything in German" rule
documentation in English

Decide on the language used in stakeholder communication

everything in German
- if any other step above was not "everything in German" a canonical translation document (wiki page) is necessary
everything in English - unrealistic because even though most Germans have knowledge of the English language most of them are not at a "native speaker" level or able to tell you the English translation of their term

In my experience the most common combination is:

source code/config files in English
source code of acceptance/BDD tests in English, specifications in German
documentation in English
canonical translation (English ⇔ German) document/wiki page
stakeholder communication in German
canonical translation (English ⇔ German) document/wiki page

I have heard of projects where everything is in German (legacy systems, banking, and insurance).

In a DDD context one might also have a ubiquitous language document and a canonical translation document

German synonyms ⇒ a single ubiquitous term ⇒ canonical translation

"Auto", "Karre", "Wagen" ⇒ "Kraftfahrzeug" ⇒ "motor vehicle"

May 18 '25 09:05 sdavids

Thanks @sdavids for your thorough analysis. You might also find the discussion here interesting - https://github.com/dev-cycles/contextive/discussions/88

This ticket is primarily just about the appropriate handling of unicode text for likely expected mismatches between glossary terminology and code.

That discussion explores a proposal for a more thorough handling of multi-language projects and could help with the scenarios you explore in this comment.

May 18 '25 10:05 chrissimon-au

Example

Let’s say we are in the box storing domain…

We might come up with BoxRepository and Box.

{
  "id": 1,
  "mass": 5.5,
  "dimensions": "1x3x5"
}

public record Box(int id, double mass, String dimensions) {}

class Box
  attr_reader :id, :mass, :dimensions

  def initialize(id, mass, dimensions)
    @id = id
    @mass = mass
    @dimensions = dimensions
  end
end

Note: This a really bad Box model!

This is how it would translate to “everything in German”:

Kistenlager and Kiste (note the additional n—Kistelager would be incorrect)

{
  "Nummer": 1,
  "Masse": 5.5,
  "Maße": "1x3x5"
}

public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String maße) {}

Does not work (illegal Ruby identifier):

class Kiste
  def initialize(nummer, masse, maße)
    @nummer = nummer
    @masse = masse
    @maße = maße
  end
end

Does not work either (duplicate identifier):

class Kiste
  def initialize(nummer, masse, masse)
    @nummer = nummer
    @masse = masse
    @masse = masse
  end
end

Reach for a synonym, e.g.:

class Kiste
  include ActiveModel::Serializers::JSON

  attr_reader :nummer, :masse, :format

  def initialize(nummer, masse, format)
    @nummer = nummer
    @masse = masse
    @format = format
  end

  def as_json(options = {})
    h = super(options)
    h.store('Nummer', h.delete(:nummer))
    h.store('Masse', h.delete(:masse))
    h.store('Maße', h.delete(:format))
  end
end

Most developers would reach for a synonym in the Java case also—using German special letters breaks easily when some team members use Windows and some use macOS/Linux with JDK < 18 :

public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String format) {}

May 18 '25 10:05 sdavids

:tada: This issue has been resolved in version 1.17.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

May 19 '25 10:05 chrissimon-au

Closing this issue now as the original intent is satisfied. Further conversation about multi-language support to take place on discussion #88 .

May 19 '25 10:05 chrissimon-au

FR: Umlauts support

ß in the glossary file

SS in the glossary file

ẞ in the glossary file

Summary

Decide on the language used in source code/config files

Decide on the language used in acceptance/BDD tests

Decide on the language used in documentation

Decide on the language used in stakeholder communication

Example

`ß` in the glossary file

`SS` in the glossary file

`ẞ` in the glossary file