wheelmap-classic
wheelmap-classic copied to clipboard
Umlauts lead to duplicate wkt-entries
We have a problem with importing new regions. As by now, regions are imported as .wkt-files into https://github.com/sozialhelden/wheelmap/tree/master/db/data/wkt/. Have a close look to regions with umlauts in their name: in folder /europe/austria, /europe/germany and /europe/switzerland, their are several duplicates like "baden_wuerttemberg.wkt" and "baden_wurttemberg.wkt" as well as duplicate folder names. This leads to problems with our Librato metrics, where we have to reassign these data sources to the corresponding dashboards. As we plan to expand and add new regions soon this should be fixed very soon. @lennerd did the import of the .wkt-files from a Dropbox-folder and startet the import task so he can tell how exactly new regions are added to the server.
@holgerd : This should be fixed within one of the next milestones as it leads to problems every time when we import new regions that are important for administrative purposes - as happened now again: https://github.com/sozialhelden/wheelmap-privateissues/issues/29
Thank you for reporting this.
Here is our additional input: Mac OS x seems to have problems to handle 2 directories with the same name but with different encoding.
Example:
Mac OS
On Mac OS we see 2x Baden-Württemberg
folders in the directory:
wheelmap/db/data/wkt/Europe/Germany/
→ ls
Baden_Württemberg
Baden_Württemberg
When we try to access both of them, we can only access 1 of them:
→ cd Baden_Württemberg/
Bodenseekreis/ Kreis_Konstanz/
Virtual Machine
On the virtual machine (OS: Ubuntu) it is possible to see and access both Baden-Württemberg
folders by using the tab
key to switch between them.
First Baden_Württemberg
folder is selected:
➜ Germany git:(add-new-regions-germany-mexico) cd Baden_Wu<0308>rttemberg/
Baden_Württemberg/ Baden_Württemberg/
Second Baden_Württemberg
folder is selected:
➜ Germany git:(add-new-regions-germany-mexico) cd Baden_Württemberg/
Baden_Württemberg/ Baden_Württemberg/
Git
Untracked files:
(use "git add <file>..." to include in what will be committed)
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_B\303\266blingen.wkt"
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_B\303\266blingen/"
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_Ludwigsburg.wkt"
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Kreis_Ludwigsburg/"
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Schwarzwald_Baar_Kreis.wkt"
"db/data/wkt/Europe/Germany/Baden_W\303\274rttemberg/Schwarzwald_Baar_Kreis/"
"db/data/wkt/Europe/Germany/Nordrhein_Westfalen/Kreis_D\303\274ren.wkt"
"db/data/wkt/Europe/Germany/Nordrhein_Westfalen/Kreis_D\303\274ren/"
There is also an issue when using Git because it recognizes that there are 2 folders with the same name but with different encoding.
Example:
➜ wheelmap git:(add-new-regions-germany-mexico) ✗ git add db/data/wkt/Europe/Germany/Baden_Wu\314\210rttemberg/Kreis_Ludwigsburg.wkt
fatal: pathspec 'db/data/wkt/Europe/Germany/Baden_Wu314210rttemberg/Kreis_Ludwigsburg.wkt' did not match any files
Using git add
command with tab key works:
➜ wheelmap git:(add-new-regions-germany-mexico) ✗ git add db/data/wkt/Europe/Germany/Baden_Wu<0308>rttemberg/Kreis_Ludwigsburg.wkt
When we try to delete 1 file with umlauts, we are not able to catch it, because of the encoding issue.
➜ wheelmap git:(add-new-regions-germany-mexico) ✗ rm db/data/wkt/Europe/Germany/Baden_Wu\314\210rttemberg/._Kreis_Ludwigsburg.wkt
rm: cannot remove `db/data/wkt/Europe/Germany/Baden_Wu314210rttemberg/._Kreis_Ludwigsburg.wkt': No such file or directory
is there any particular reason the folder needs to be named Baden_Württemberg/
? Naming the folder baden_wuerttemberg
would sidestep the issue. Generally speaking the naming convention "all lowercase, no special chars" is good when working with data that crosses system boundaries.
Technical content for the interested:
The fundamental reason for this problem is that unicode allows two representation for the ü
: ü
and u¨
are both equivalent representations and the makers of the linux filesystem chose one and the makers of the mac filesystem chose the other. That's why a phantom folder shows up. It can't be added to git since it's already there but it shows up in git status since the one on disk differs in the representation from the one that git tracks internally. It's strictly speaking not even a bug, so I wouldn't expect a fix on the git side and we can't fix it at all. Similar problems occur with capital letters: for macos, Baden
and baden
are equivalent, but linux filesystems treat both as different files (HFS+ is case preserving, case insensitive, ext4 is case preserving, case sensitive). It's possible to create a git repo that can't be checked out on a mac.
@1000miles So to fix this issue all we need to do is to ensure the suggested naming convention, correct?
naming convention "all lowercase, no special chars"
Related PR: #443
Update, 31 Oct 2016:
The renaming of all wkt-files that contain umlauts is introduced and done by PR https://github.com/sozialhelden/wheelmap/pull/443.
Question:
@holgerd
At the moment all regions are capitalized. Is there still a need to rename them all to lowercase?