swift-foundation icon indicating copy to clipboard operation
swift-foundation copied to clipboard

Swedish å,ä, ö are treated as diacritics

Open jhansbo opened this issue 1 year ago • 4 comments

The Scandinavian languages and the Finnish language, by contrast, treat the characters with diacritics å, ä, and ö as distinct letters of the alphabet, and sort them after z. Usually ä (a-umlaut) and ö (o-umlaut) [used in Swedish and Finnish] are sorted as equivalent to æ (ash) and ø (o-slash) [used in Danish and Norwegian]. Also, aa, when used as an alternative spelling to å, is sorted as such. Other letters modified by diacritics are treated as variants of the underlying letter, with the exception that ü is frequently sorted as y.

import Foundation

let symbol = "The Swedish letters Å, Ä, Ö" let string = "a" let symbolRange = symbol.range(of: string, options: [.caseInsensitive, .diacriticInsensitive])

if let range = symbolRange { print("Found (string) in (symbol)") } else { print("(string) not found in (symbol)") }

Prints Found 'a' in 'The Swedish letters Å, Ä, Ö'

Should print 'a' not found in 'The Swedish letters Å, Ä, Ö'

Replacing the string with "o" — same issue.

jhansbo avatar Apr 28 '24 13:04 jhansbo

Right, as a Swedish native speaker the current behavior is very strange - if matching without diacritics we get what is the incorrect result really.

hassila avatar Apr 28 '24 15:04 hassila

Also note that sorting will be incorrect. A, B, C, D, ...., X, Y, Z, Å, Ä, Ö is the correct sorting order for the Swedish alphabet. In a Swedish dictionary é is sorted along with e and ü is sorted along with u (both are true diacritics), but å and ä are not sorted along with a and ö is not sorted along with o.

As mentioned, this is a problem also for Norwegian and Danish. It's peculiar that only Å and Æ are considered diacritics (Danish equivalent of Å and Ä) but Ø is not (Danish equivalent of Ö).

jhansbo avatar Apr 29 '24 11:04 jhansbo

This API range(of:, options:) isn't locale/language aware. While these letters are distinct letters in Swedish, they are indeed diacritics in other languages, so it's challenging to make that distinction here.

That being said, I would definitely expect the localized version of this API, e.g. range(of: string, options: [.caseInsensitive, .diacriticInsensitive], locale: Locale(languageCode: .swedish)) to return what you described, but it isn't currently. Would you agree that we should track that issue instead?

itingliu avatar Jul 15 '24 22:07 itingliu

It seems there are way more languages treating them as separate letters

See e.g.

https://en.wikipedia.org/wiki/Å#:~:text=It%20is%20a%20separate%20letter,Pamirian%20languages%2C%20and%20Greenlandic%20alphabets.

But as there is no single correct answer, moving this case to be for the Swedish locale would be ok I think. (Although I think the locale-unaware default is debatable, I guess it's been that way for some time...)

hassila avatar Jul 17 '24 17:07 hassila