countries Wrong encoding

How to fix the name of the country so that these strange characters do not appear?

In code:

In database:

I changed encoding of table & columns to UTF-16LE:

Jan 06 '20 08:01 art-es

I join this issue!

Jan 14 '20 17:01 VictorPulzz

Same problem here. How can I fix it ?

Apr 30 '20 10:04 remif25

You can utf8_decode($countrie["name_en"])

Aug 29 '20 15:08 Marivint

Can it be that the encoding in the backend is wrong? Or double encoded?

Jan 29 '21 16:01 klodoma

Same issue here

Feb 08 '21 11:02 devoncmather

You can utf8_decode($countrie["name_en"])

Yes, this works, but it doesn't make sense to me. Seems that something is double encoded, haven't checked the sources though yet.

Feb 08 '21 12:02 klodoma

I've made a check on the json countries file.

It seems that the double encoded values are the translated ones (Eg. the ones in the fields "name_XX"). For example Österreich is encoded in name_de as "\u00c3\u0096sterreich" and an utf8_decode returns the correct value of "\u00d6sterreich" which is the value under the "name->native->bar->common" field

Feb 11 '21 17:02 giannicic

For example Österreich is encoded in name_de as "\u00c3\u0096sterreich" and an utf8_decode returns the correct value of "\u00d6sterreich"

Yes, exactly. utf8_decode fixes it for the moment.

We'll have to monitor when this gets fixed in the package, we'll have to remove our utf8_decode when this is done.

Feb 11 '21 18:02 klodoma

I have used the solution suggested here for Laravel Collection and it worked:

use PragmaRX\Countries\Package\Countries as Country;

Collection::macro('decode', function () {
    return $this->map(function ($value) {
        return utf8_decode($value);
    });
});

return Country::all()->pluck('name_tr', 'cca3')->decode();

May 07 '21 09:05 ademtepe

The reason it's not done is because it's not easy to decode/re-encode them all correctly. Something I always have to say: the data we have here was not done by me, it's a collection of many other sources, and people just choose what they want/can use, I have zero control over this.

Unfortunately utf8_decode(); is not a solution either. While trying to insert all cities into a PostgreSQL database I got this myself:

So if someone can come up with a strong solution for correctly enconding everything to UTF8, I'm more than pleased to merge a PR.

Cheers!

May 16 '21 22:05 antonioribeiro

This is working for me:

protected function decode(?string $name): ?string
{
    if (blank($name) || mb_detect_encoding($name) !== 'UTF-8') {
        return $name;
    }

    return utf8_decode($name);
}

But I'm unsure if we have to do this in the package. I can't check if ALL encodings are good, and probably not every single will be fixed. Also, it will take a lot more time to generate all the files, which is already very slow. Any thoughts?

May 17 '21 18:05 antonioribeiro

It didn't really fix them all, still had a lot of strings wrongly encoded, so I found this forceutf8 package that solved it (not fully too, still got some wrong, but it's way better):

protected function decode(?string $name): ?string
{
    if (blank($name)) {
        return $name;
    }

    if (mb_detect_encoding($name) !== 'UTF-8') {
        return Encoding::toUTF8($name);
    }

    return Encoding::fixUTF8($name);
}

May 18 '21 01:05 antonioribeiro

@antonioribeiro from where are you getting the countries data or how are you putting that data together? Regarding the data, I think it would make sense to fix the data in the JSON files, even if they are coming from many sources.

Not sure if it helps right now, but initially I thought I have a conversion issue, so I opened up this thread on Stack Overflow: https://stackoverflow.com/questions/65956182/php-unicode-to-character-conversion

May 18 '21 07:05 klodoma

@klodoma , here you have a list of sources I'm using: https://github.com/antonioribeiro/countries#copyright. Sanitize encoding is not impossible, but it's a lot of data to sanitize. And they may require different strategies.

May 19 '21 22:05 antonioribeiro

The issue is that part of the unicode is in unicode codepoint notation ("common": "\u00d6sterreich"), while a few lines down the same is encoded as a UTF-8 hex bytestring ("name_de": "\u00c3\u0096sterreich"). I can't imagine how the decoder should know what to do here. The first string is translated correctly into Österreich, while the second is translated into \u00d6sterreich (that's why the utf8_decode works for us in that case).

So, should we go with utf8_decode? Yes, but... be aware that if you are using one of the columns that are encoded differently (like name->common or name->native), you will end up with a binary string representation:

utf8_decode(json_decode('"'. "\u00d6sterreich" . '"'))
=> b"Österreich"

No fun... I would suggest rebuilding the whole JSON files with consistent encoding, or for that matter, I am going to go back to using mledoze/countries, which was in better shape in that regard.

Sep 04 '21 00:09 lupinitylabs

Indeed, encoding is inconsistent and I do believe @lupinitylabs's suggestion makes sense. Is there any solution brewing for this @antonioribeiro? Regardless, I'd suggest you add technical information in your README in order to help developers handle those inconsistencies properly.

Oct 01 '21 17:10 ftrudeau-pelcro

Any update here @antonioribeiro ?

Oct 27 '21 01:10 ftrudeau-pelcro

countries countries copied to clipboard

Wrong encoding

countries
countries copied to clipboard