countries
countries copied to clipboard
Wrong encoding
How to fix the name of the country so that these strange characters do not appear?
In code:
In database:
I changed encoding of table & columns to UTF-16LE:
I join this issue!
Same problem here. How can I fix it ?
You can utf8_decode($countrie["name_en"])
Can it be that the encoding in the backend is wrong? Or double encoded?
Same issue here
You can utf8_decode($countrie["name_en"])
Yes, this works, but it doesn't make sense to me. Seems that something is double encoded, haven't checked the sources though yet.
I've made a check on the json countries file.
It seems that the double encoded values are the translated ones (Eg. the ones in the fields "name_XX").
For example Österreich is encoded in name_de as "\u00c3\u0096sterreich"
and an utf8_decode
returns the correct value of "\u00d6sterreich"
which is the value under the "name->native->bar->common"
field
For example Österreich is encoded in name_de as
"\u00c3\u0096sterreich"
and anutf8_decode
returns the correct value of"\u00d6sterreich"
Yes, exactly. utf8_decode
fixes it for the moment.
We'll have to monitor when this gets fixed in the package, we'll have to remove our utf8_decode
when this is done.
I have used the solution suggested here for Laravel Collection and it worked:
use PragmaRX\Countries\Package\Countries as Country;
Collection::macro('decode', function () {
return $this->map(function ($value) {
return utf8_decode($value);
});
});
return Country::all()->pluck('name_tr', 'cca3')->decode();
The reason it's not done is because it's not easy to decode/re-encode them all correctly. Something I always have to say: the data we have here was not done by me, it's a collection of many other sources, and people just choose what they want/can use, I have zero control over this.
Unfortunately utf8_decode();
is not a solution either. While trying to insert all cities into a PostgreSQL database I got this myself:

So if someone can come up with a strong solution for correctly enconding everything to UTF8, I'm more than pleased to merge a PR.
Cheers!
This is working for me:
protected function decode(?string $name): ?string
{
if (blank($name) || mb_detect_encoding($name) !== 'UTF-8') {
return $name;
}
return utf8_decode($name);
}
But I'm unsure if we have to do this in the package. I can't check if ALL encodings are good, and probably not every single will be fixed. Also, it will take a lot more time to generate all the files, which is already very slow. Any thoughts?
It didn't really fix them all, still had a lot of strings wrongly encoded, so I found this forceutf8 package that solved it (not fully too, still got some wrong, but it's way better):
protected function decode(?string $name): ?string
{
if (blank($name)) {
return $name;
}
if (mb_detect_encoding($name) !== 'UTF-8') {
return Encoding::toUTF8($name);
}
return Encoding::fixUTF8($name);
}
@antonioribeiro from where are you getting the countries data or how are you putting that data together? Regarding the data, I think it would make sense to fix the data in the JSON files, even if they are coming from many sources.
Not sure if it helps right now, but initially I thought I have a conversion issue, so I opened up this thread on Stack Overflow: https://stackoverflow.com/questions/65956182/php-unicode-to-character-conversion
@klodoma , here you have a list of sources I'm using: https://github.com/antonioribeiro/countries#copyright. Sanitize encoding is not impossible, but it's a lot of data to sanitize. And they may require different strategies.
The issue is that part of the unicode is in unicode codepoint notation ("common": "\u00d6sterreich"
), while a few lines down the same is encoded as a UTF-8 hex bytestring ("name_de": "\u00c3\u0096sterreich"
). I can't imagine how the decoder should know what to do here. The first string is translated correctly into Österreich
, while the second is translated into \u00d6sterreich
(that's why the utf8_decode
works for us in that case).
So, should we go with utf8_decode
? Yes, but... be aware that if you are using one of the columns that are encoded differently (like name->common
or name->native
), you will end up with a binary string representation:
utf8_decode(json_decode('"'. "\u00d6sterreich" . '"'))
=> b"Österreich"
No fun... I would suggest rebuilding the whole JSON files with consistent encoding, or for that matter, I am going to go back to using mledoze/countries, which was in better shape in that regard.
Indeed, encoding is inconsistent and I do believe @lupinitylabs's suggestion makes sense. Is there any solution brewing for this @antonioribeiro? Regardless, I'd suggest you add technical information in your README in order to help developers handle those inconsistencies properly.
Any update here @antonioribeiro ?