FR: Umlauts support
Usually in programming you do not use umlauts in code, but in descriptions to terms that might be apropriate.
so in code you could have something like Ausloeser but similar to the plural detection this should reference to Auslöser in the glossary
Interesting idea - I can see the value. I will take a look at how oe could be considered equivalent ö for the purposes of displaying a hover.
In the meantime, as a workaround, you could define an alias, e.g.
- name: Auslöser
aliases:
- Ausloeser
which would, I think, give the effect you're looking for.
Here is how you would do it in Java:
String normalize(String s) {
return Normalizer.normalize(s, Normalizer.Form.NFKD)
.replaceAll("\\p{M}", "")
.replaceAll("ß", "s")
.replaceAll("ẞ", "s");
}
void main() {
var tests = List.of("Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka");
for (var t : tests) {
println(String.format("%s -> %s", t, normalize(t)));
}
}
and in Go:
package main
import (
"fmt"
"unicode"
"golang.org/x/text/runes"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
tests := []string{"Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka"}
for _, t := range tests {
n, _ := normalize(t)
fmt.Printf("%s -> %s\n", t, n)
}
}
func normalize(s string) (string, error) {
t := transform.Chain(norm.NFD,
runes.Remove(runes.In(unicode.Mn)),
runes.Map(func(r rune) rune {
switch r {
case 'ß':
return 's'
case 'ẞ':
return 's'
}
return r
}),
norm.NFC)
r, _, err := transform.String(t, s)
if err != nil {
return "", err
}
return r, nil
}
Their output:
Hallo -> Hallo
Übergrößengeschäft -> Ubergrosengeschaft
mañana es sábado -> manana es sabado
kir à l’aÿ -> kir a l’ay
kočka -> kocka
Note the special case for ß and ẞ: They are proper letters instead of diacritical marks on a letter.
Thanks @sdavids for the pointers, they were very helpful. In this case however, I do need to handle it slightly differently, as @sschneider-ihre-pvs 's request was for Ausloeser in code to match Auslöser as a defined term. The scripts above remove the combining mark resulting in Ausloser.
However, in languages other than german, simply removing the combining mark works fine.
To that end, I've been working on a solution that tests both and will match either the case with the combining mark removed, or specially handled cases for the german umlaut where it is replaced with an e, e.g. ö becoming oe. (The contextive philosophy is generally to match more loosely than strictly.)
So with this change, if the code contained ausloser, ausloeser, OR auslöser they would all show the definitions of a term defined with Auslöser.
And in other languages, e.g. French, pere OR père would match a term defined as Père
I hope this suits your needs, @sschneider-ihre-pvs ?
Regarding the ß character, I understand it is commonly replaced with ss. Is that your experience? To that end it's now only matching with ss, e.g. strasse or straße would match a term defined as Straße, but strase would not. Is that acceptable?
You can see a preview of the proposed documentation change to describe this feature here: https://docs.test.contextive.tech/community/v/c961ccc/guides/defining-terminology/#unicode-and-diacritics - feedback welcome!
The test cases are here: https://github.com/dev-cycles/contextive/blob/main/src/language-server/Contextive.LanguageServer.Tests/E2e/HoverTests.fs#L84
Note: You forgot the capital ẞ:
https://github.com/dev-cycles/contextive/blob/c961ccc9bffe95954a37cbfae4a85f5ee7a3a18e/src/core/Contextive.Core/GlossaryFile.fs#L39
A German test case:
Noun "Größe" (size)
Größe - correct spelling
Groesse - someone typing on a keyboard w/out German letters or someone being lazy
groesse - someone typing on a keyboard w/out German letters and not caring about capitalization or someone being lazy
größe - lower case or not caring about capitalization
GRÖSSE - old upper case spelling
GRÖẞE - new (since 2017) upper case spelling with ẞ
GRÖßE - incorrect spelling with lower case ß
All 7 variants would be considered the same word by a German (most Germans do not know about the capital ẞ though 😂).
It get's hairy:
Noun "Masse" (mass) Noun "Maße" (dimensions)
with the old upper case spelling:
Masse ⇒ MASSE Maße ⇒ MASSE
with the new upper case spelling:
Masse ⇒ MASSE Maße ⇒ MAẞE
Uppercase conversion with the old rules is irreversible in German—MASSE could have been derived from two distinct words "Masse" or "Maße".
In this case one cannot use ss and ß interchangeably because that would change the meaning.
Other languages have similar quirks.
I favor a Pareto solution instead of a 100% solution.
Regarding the
ßcharacter, I understand it is commonly replaced withss.
German orthography has changed quite a bit in recent years; especially with the Reform der deutschen Rechtschreibung 1996—ss vs. ß was an important (and at that time contentious) part of it.
To that end it's now only matching with
ss, e.g.strasseorstraßewould match a term defined asStraße, butstrasewould not. Is that acceptable?
See above.
A German would know what strase means though.
People with German as a second language might also write it that way because they are not familiar with the intricacies of ss vs. ß yet.
Some use s, ss, and ß interchangeably in colloquial text—to the horror of German Grammatiknazis.
Note: You forgot the capital
ẞ:contextive/src/core/Contextive.Core/GlossaryFile.fs
Line 39 in c961ccc
s.Replace("\u0308", "e").Replace("ß", "ss")
Thanks @sdavids - I initially excluded it because we do an IgnoreCase comparison so thought it wouldn't matter, but your comment helped me realise that the IgnoreCase comparison is after the normalisation, so it does need doing explicitly. Added a test case and support for this now.
With the new implementation, having experimented a bit, this is how it works over a few scenarios, hopefully it makes sense:
ß in the glossary file
If the glossary file contains:
contexts:
- terms:
- name: Masse
definition: Mass (English)
- name: Maße
definition: dimensions (English)
And the code contains masse then the following hover appears, showing both options, as we can't disambiguate:
If the code contains maße then the following hover appears, showing only the Maße option:
SS in the glossary file
If the glossary file contains:
contexts:
- terms:
- name: MASSE
definition: Mass (English)
- name: MASSE
definition: dimensions (English)
And the code contains masse then again, both defnitions are shown:
If the code contains maße then nothing is shown. We don't reverse SS from the glossary file to match ß or ẞ in the code. From everything above, it seems unlikely that this would happen - if ß or ẞ are in the code, then they would also be used in the definitions file.
ẞ in the glossary file
If the glossary file contains:
contexts:
- terms:
- name: MASSE
definition: Mass (English)
- name: MAẞE
definition: dimensions (English)
And the code contains masse then both options are shown:
If the code contains maße then only MAẞE is shown:
Summary
Does this follow the most common expectations in your experience?
In a German software project…
Decide on the language used in source code/config files
- everything in German
- “Autofabrik”, “Personenlager” - correct spelling
- “AutoFabrik”, “PersonenLager” - incorrect spelling but better IDE and find/replace DX
- "FabrikAuto", "LagerPersonen" - incorrect spelling but better sort DX
- everything in German with Anglicisms
- “Autofactory” , Personenrepository” - correct spelling but feels weird to use for a German
- “AutoFactory” , PersonenRepository” - incorrect spelling but better IDE and find/replace and DX
- “FactoryAuto”, “RepositoryPersonen” - incorrect spelling but better sorting DX
- everything in English—open to non-German speaking team members and a possible company/project merger in the future
- "CarFactory", "PersonRepository"
- "FactoryCar", "RepositoryPerson" - better sort DX
Another consideration is the programming language one is using.
Some do not support non-ASCII identifiers (Ruby) or used to (Rust, Python), in that case one has to decide on how to handle äöüßÄÖÜẞ: äöüßÄÖÜẞ⇒ae,oe,ue,ss,Ae,Oe,Ue,SS or äöüßÄÖÜẞ⇒a,o,u,s,A,O,U,S or go with the “everything in English” option for that programming language only.
Decide on the language used in acceptance/BDD tests
- everything in German
- source code in English, specifications in German
- everything in English
Decide on the language used in documentation
- everything in German
- if the decision was made to use English in source code/config files then usually a document (wiki page) with canonical translations is created, i.e. German “Auto” is always translated as “Car” and not “Auto”, and source code snippet are exempt from the "everything in German" rule
- documentation in English
Decide on the language used in stakeholder communication
- everything in German
- if any other step above was not "everything in German" a canonical translation document (wiki page) is necessary
- everything in English - unrealistic because even though most Germans have knowledge of the English language most of them are not at a "native speaker" level or able to tell you the English translation of their term
In my experience the most common combination is:
- source code/config files in English
- source code of acceptance/BDD tests in English, specifications in German
- documentation in English
- canonical translation (English ⇔ German) document/wiki page
- stakeholder communication in German
- canonical translation (English ⇔ German) document/wiki page
I have heard of projects where everything is in German (legacy systems, banking, and insurance).
In a DDD context one might also have a ubiquitous language document and a canonical translation document
German synonyms ⇒ a single ubiquitous term ⇒ canonical translation
"Auto", "Karre", "Wagen" ⇒ "Kraftfahrzeug" ⇒ "motor vehicle"
Thanks @sdavids for your thorough analysis. You might also find the discussion here interesting - https://github.com/dev-cycles/contextive/discussions/88
This ticket is primarily just about the appropriate handling of unicode text for likely expected mismatches between glossary terminology and code.
That discussion explores a proposal for a more thorough handling of multi-language projects and could help with the scenarios you explore in this comment.
Example
Let’s say we are in the box storing domain…
We might come up with BoxRepository and Box.
{
"id": 1,
"mass": 5.5,
"dimensions": "1x3x5"
}
public record Box(int id, double mass, String dimensions) {}
class Box
attr_reader :id, :mass, :dimensions
def initialize(id, mass, dimensions)
@id = id
@mass = mass
@dimensions = dimensions
end
end
Note: This a really bad Box model!
This is how it would translate to “everything in German”:
Kistenlager and Kiste (note the additional n—Kistelager would be incorrect)
{
"Nummer": 1,
"Masse": 5.5,
"Maße": "1x3x5"
}
public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String maße) {}
Does not work (illegal Ruby identifier):
class Kiste
def initialize(nummer, masse, maße)
@nummer = nummer
@masse = masse
@maße = maße
end
end
Does not work either (duplicate identifier):
class Kiste
def initialize(nummer, masse, masse)
@nummer = nummer
@masse = masse
@masse = masse
end
end
Reach for a synonym, e.g.:
class Kiste
include ActiveModel::Serializers::JSON
attr_reader :nummer, :masse, :format
def initialize(nummer, masse, format)
@nummer = nummer
@masse = masse
@format = format
end
def as_json(options = {})
h = super(options)
h.store('Nummer', h.delete(:nummer))
h.store('Masse', h.delete(:masse))
h.store('Maße', h.delete(:format))
end
end
Most developers would reach for a synonym in the Java case also—using German special letters breaks easily when some team members use Windows and some use macOS/Linux with JDK < 18 :
public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String format) {}
:tada: This issue has been resolved in version 1.17.0 :tada:
The release is available on GitHub release
Your semantic-release bot :package::rocket:
Closing this issue now as the original intent is satisfied. Further conversation about multi-language support to take place on discussion #88 .