mysql-convert-latin1-to-utf8 icon indicating copy to clipboard operation
mysql-convert-latin1-to-utf8 copied to clipboard

utf8 string cut off

Open Keno-Chile opened this issue 7 years ago • 4 comments

After runnig the script words like: Américo, año, agüita, and other spanish words become: Am, a, ag (the words got sliced. I tryed changing ther encoding to utf8mb4, but got the same result.

Keno-Chile avatar Aug 07 '18 16:08 Keno-Chile

@Keno-Chile can you share a partial dump of your input data you're trying to convert?

nicjansma avatar Sep 29 '18 23:09 nicjansma

I have the same issue. For example "DO BI¯UTERII" is changed to "DO BI"

dszyfelb avatar Oct 19 '19 20:10 dszyfelb

@dszyfelb a shared partial data dump might be helpful to debug!

nicjansma avatar Dec 07 '19 14:12 nicjansma

Thanks again for sharing your work! I've encountered the same issue.

I tested thoroughly on a staging server loaded with a copy of production data, first executing on the structure only (no errors), and then on the full database. All seemed to go well. Knowing many of the records in our 10 year old MySQL database are non-critical, I executed on the production database after confirming all current/sensitive records were intact. Here's the collation map I used:

// TODO: The collation you want to convert the overall database to
$defaultCollation = 'utf8mb4_0900_ai_ci';

// TODO Convert column collations and table defaults using this mapping
// latin1_swedish_ci is included since that's the MySQL default
$collationMap =
 array(
  'latin1_bin'        => 'utf8_bin',
  'latin1_general_ci' => 'utf8mb4_0900_ai_ci',
  'latin1_swedish_ci' => 'utf8mb4_0900_ai_ci',
 );
  1. All instances of FedEx International Economy® became FedEx International Economy
  2. Most German was dropped: Schönfliess become Sch. The same happened to words containing characters ü and ß, etc.
  3. Działdowska remained Działdowska

PHP version: 7.4.3 MySQL: 8.0.22-0ubuntu0.20.04.3 (Ubuntu)

Although the lost (old) data is unimportant to our needs, I have full backups of the original data in latin1_swedish_ci and have pulled a few of the problematic records. I'll be glad to experiment further if you have any ideas.

Edit: Looking through my SQL backups, all backed up tables were dumped without explicit declaration of the collation, i.e.: ENGINE=InnoDB DEFAULT CHARSET=latin1;

After running the script, all tables now show an explicit collation, even though this matches the MySQL 8.0 server default: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Is it possible the script was unable to read the collation?

jseaber avatar Jan 27 '21 03:01 jseaber