wheelmap-classic icon indicating copy to clipboard operation
wheelmap-classic copied to clipboard

Umlauts lead to duplicate wkt-entries

Open thermann78 opened this issue 9 years ago • 6 comments

We have a problem with importing new regions. As by now, regions are imported as .wkt-files into https://github.com/sozialhelden/wheelmap/tree/master/db/data/wkt/. Have a close look to regions with umlauts in their name: in folder /europe/austria, /europe/germany and /europe/switzerland, their are several duplicates like "baden_wuerttemberg.wkt" and "baden_wurttemberg.wkt" as well as duplicate folder names. This leads to problems with our Librato metrics, where we have to reassign these data sources to the corresponding dashboards. As we plan to expand and add new regions soon this should be fixed very soon. @lennerd did the import of the .wkt-files from a Dropbox-folder and startet the import task so he can tell how exactly new regions are added to the server.

thermann78 avatar Dec 08 '15 15:12 thermann78

@holgerd : This should be fixed within one of the next milestones as it leads to problems every time when we import new regions that are important for administrative purposes - as happened now again: https://github.com/sozialhelden/wheelmap-privateissues/issues/29

thermann78 avatar May 11 '16 12:05 thermann78

Thank you for reporting this.

Here is our additional input: Mac OS x seems to have problems to handle 2 directories with the same name but with different encoding.

Example:

Mac OS

On Mac OS we see 2x Baden-Württemberg folders in the directory: wheelmap/db/data/wkt/Europe/Germany/

→ ls
Baden_Württemberg
Baden_Württemberg

When we try to access both of them, we can only access 1 of them:

→ cd Baden_Württemberg/
Bodenseekreis/   Kreis_Konstanz/

Virtual Machine

On the virtual machine (OS: Ubuntu) it is possible to see and access both Baden-Württemberg folders by using the tabkey to switch between them.

First Baden_Württemberg folder is selected:

➜  Germany git:(add-new-regions-germany-mexico) cd Baden_Wu<0308>rttemberg/
Baden_Württemberg/  Baden_Württemberg/

Second Baden_Württemberg folder is selected:

➜  Germany git:(add-new-regions-germany-mexico) cd Baden_Württemberg/
Baden_Württemberg/  Baden_Württemberg/

Git

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_B\303\266blingen.wkt"
    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_B\303\266blingen/"
    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_Ludwigsburg.wkt"
    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_Ludwigsburg/"
    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Schwarzwald_Baar_Kreis.wkt"
    "db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Schwarzwald_Baar_Kreis/"
    "db/data/wkt/Europe/Germany/Nordrhein_Westfalen/Kreis_D\303\274ren.wkt"
    "db/data/wkt/Europe/Germany/Nordrhein_Westfalen/Kreis_D\303\274ren/"

There is also an issue when using Git because it recognizes that there are 2 folders with the same name but with different encoding.

Example:

➜  wheelmap git:(add-new-regions-germany-mexico) ✗ git add db/data/wkt/Europe/Germany/Baden_Wu\314\210rttemberg/Kreis_Ludwigsburg.wkt
fatal: pathspec 'db/data/wkt/Europe/Germany/Baden_Wu314210rttemberg/Kreis_Ludwigsburg.wkt' did not match any files

Using git add command with tab key works:

➜  wheelmap git:(add-new-regions-germany-mexico) ✗ git add db/data/wkt/Europe/Germany/Baden_Wu<0308>rttemberg/Kreis_Ludwigsburg.wkt

When we try to delete 1 file with umlauts, we are not able to catch it, because of the encoding issue.

➜  wheelmap git:(add-new-regions-germany-mexico) ✗ rm db/data/wkt/Europe/Germany/Baden_Wu\314\210rttemberg/._Kreis_Ludwigsburg.wkt
rm: cannot remove `db/data/wkt/Europe/Germany/Baden_Wu314210rttemberg/._Kreis_Ludwigsburg.wkt': No such file or directory

1000miles avatar May 11 '16 15:05 1000miles

is there any particular reason the folder needs to be named Baden_Württemberg/? Naming the folder baden_wuerttemberg would sidestep the issue. Generally speaking the naming convention "all lowercase, no special chars" is good when working with data that crosses system boundaries.

Technical content for the interested:

The fundamental reason for this problem is that unicode allows two representation for the ü: ü and are both equivalent representations and the makers of the linux filesystem chose one and the makers of the mac filesystem chose the other. That's why a phantom folder shows up. It can't be added to git since it's already there but it shows up in git status since the one on disk differs in the representation from the one that git tracks internally. It's strictly speaking not even a bug, so I wouldn't expect a fix on the git side and we can't fix it at all. Similar problems occur with capital letters: for macos, Baden and baden are equivalent, but linux filesystems treat both as different files (HFS+ is case preserving, case insensitive, ext4 is case preserving, case sensitive). It's possible to create a git repo that can't be checked out on a mac.

Xylakant avatar Sep 07 '16 16:09 Xylakant

@1000miles So to fix this issue all we need to do is to ensure the suggested naming convention, correct?

naming convention "all lowercase, no special chars"

holgerd avatar Oct 13 '16 09:10 holgerd

Related PR: #443

1000miles avatar Oct 19 '16 10:10 1000miles

Update, 31 Oct 2016:

The renaming of all wkt-files that contain umlauts is introduced and done by PR https://github.com/sozialhelden/wheelmap/pull/443.

Question:

@holgerd

At the moment all regions are capitalized. Is there still a need to rename them all to lowercase?

1000miles avatar Oct 31 '16 09:10 1000miles