BR failed to backup stats
Please answer these questions before submitting your issue. Thanks!
-
What did you do? If possible, provide a recipe for reproducing the error. Upgrade TiDB cluster from v4.0.0 to v4.0.9. then use BR v4.0.9 to backup.
-
What did you expect to see? backup success.
-
What did you see instead?
[ERROR] [backup.go:25] ["failed to backup"] [error="[types:1406]Data Too Long, field len 5, data len 6"] [errorVerbose="[types:1406]Data Too Long, field len 5, data len 6\n
github.com/pingcap/errors.AddStack\n
\t github.com/pingcap/[email protected]/errors.go:174\n
github.com/pingcap/errors.(*Error).GenWithStack\n
\t github.com/pingcap/[email protected]/normalize.go:147\n
github.com/pingcap/tidb/types.ProduceStrWithSpecifiedTp\n
\t github.com/pingcap/[email protected]/types/datum.go:995\n
github.com/pingcap/tidb/types.(*Datum).convertToString\n
\t github.com/pingcap/[email protected]/types/datum.go:942\n
github.com/pingcap/tidb/types.(*Datum).ConvertTo\n
\t github.com/pingcap/[email protected]/types/datum.go:821\n
github.com/pingcap/tidb/statistics/handle.(*Handle).histogramFromStorage\n
\t github.com/pingcap/[email protected]/statistics/handle/handle.go:667\n
github.com/pingcap/tidb/statistics/handle.(*Handle).columnStatsFromStorage\n
\t github.com/pingcap/[email protected]/statistics/handle/handle.go:460\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsFromStorage\n
\t github.com/pingcap/[email protected]/statistics/handle/handle.go:539\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsToJSON\n
\t github.com/pingcap/[email protected]/statistics/handle/dump.go:89\n
github.com/pingcap/tidb/statistics/handle.(*Handle).DumpStatsToJSON\n
\t github.com/pingcap/[email protected]/statistics/handle/dump.go:76\n
github.com/pingcap/br/pkg/backup.BuildBackupRangeAndSchema\n
\t github.com/pingcap/br@/pkg/backup/client.go:340\n
github.com/pingcap/br/pkg/task.RunBackup\n
\t github.com/pingcap/br@/pkg/task/backup.go:257\n
github.com/pingcap/br/cmd.runBackupCommand\n
\t github.com/pingcap/br@/cmd/backup.go:24\n
github.com/pingcap/br/cmd.newFullBackupCommand.func1\n
\t github.com/pingcap/br@/cmd/backup.go:85\n
github.com/spf13/cobra.(*Command).execute\n
\t github.com/spf13/[email protected]/command.go:842\n
github.com/spf13/cobra.(*Command).ExecuteC\n
\t github.com/spf13/[email protected]/command.go:950\n
github.com/spf13/cobra.(*Command).Execute\n
\t github.com/spf13/[email protected]/command.go:887\n
main.main\n
\t github.com/pingcap/br@/main.go:58\n
runtime.main\n
\t runtime/proc.go:203\n
runtime.goexit\n
\t runtime/asm_amd64.s:1357"] [
- What version of BR and TiDB/TiKV/PD are you using? BR v4.0.9 TiDB v4.0.9 (upgrade from v4.0.0)
The failed reason is DumpsStatsToJson is not compatible when TiDB is upgraded from v4.0.0.
To workaround, user can add --ignore-stats=true to command to skip backup stats. and we need find a better way to backup stats and consider with compatibility.
This is a bug in TiDB. Still, such failure probably shouldn't abort backup/restore since stats aren't essential.
also, the dump stats error should log which table is the cause.
https://github.com/pingcap/br/blob/458d542b19ca8563cac4322888063476cd050c4d/pkg/backup/client.go#L348-L358
What is the trigger mechanism of dumpStats failure?
stats shouldn't cause the backup to fail. If the stats can be backup&restored, it is good. If BR can't back up stats, it needs to output a backup summary log indicating that backup stats failed; and if there is no stats when restoring, then BR needs to be able to support running analyze table
stats information is stored to backupmeta. what If the entire cluster is backup in a large-scale cluster? how to deal with the large size backupmeta, is it sense?
if the stats field is nil during restore, it will just be skipped.
https://github.com/pingcap/br/blob/458d542b19ca8563cac4322888063476cd050c4d/pkg/restore/client.go#L816-L826
missing stats won't prevent user accessing the data, just that the queries are not optimized. ANALYZE TABLE can be performed separately.
OK, it can be a workaround method.
For a better user experience, restore needs to support analyze table for tables without stats. Many users may not know analyze table, and restore should be a complete operation, the cluster should be in a good state after restoring
running "analyze table" requires scanning the tables again though, which is slow (esp. when restoring from an old backup where the stats are all missing)
This is true, but analyze table is necessary. So we should make sure that the backup statistics need to be successful in major situations
As discussed with @kennytm, some consensuses are as follows
- the most important thing is that BR needs to ensure the success rate of backup
statsto ensure that most of the backup statistics can be backed up successfully, the success rate can reach 99.9% - table restored without
statscan be considered unavailable, and requires human orrestoreprogram intervention -
ignore-statsoption of backup is mainly for testing. If someone doesbackupwithignore-stats,restoredoes not need to executeanalyze tableautomatically - BR restore should provide an option
skip-analyzewhose default values is false - the backup that is from tidb (version < 4.0.9) can skip
analyze table - If there are table backups that do not contain
statsinformation, the user should be reminded thatrestorewould take a long time - If
restorefails to executeanalyze tablestatement, the log will prompt the user to manually executeanalyze table, and continue to restore other tables.
my preference is:
4. the option should be --analyze (default is true) which could be set to false for those who really can't wait
5. no need to do version check. we do analyze even if restoring pre-4.0.9 backup archives
8. we should clearly document this analyze behavior.
I have no problem with these preferences ⏩
For version <4.0.9, I have no special requirements. In order to restore quickly, we can allow the restore of version less than 4.0.9 not to execute analyze table, you can choose it. But I don’t want there are many stats backup failures after version 4.0.9 to avoid too long restore time
I think we made some wrong decisions before, based on the inertia to solve the current problem. I have a new proposal. The product factors considered are as follows: One is that users do not need to know statistical information at all; the other is that we should ensure the success rate of statistical information backup and restoration.
I suggest not to introduce more complicated logic. Because ignore stats already exists, we can temporarily shelve it, but it is recommended not to introduce skip analyze, especially in the future we do not want this parameter in the product, it is only the temporary solution. What we really need is a fast, highly successful statistical information backup and recovery program, unless we can't do it.
The recommended processing logic is as follows
- If the backup statistics fails, the backup fails
- If the restoration of statistical information fails, the restoration fails, but it can prompt the user that the data has been successfully restored, and you can manually execute analyze table; even support the separate restoration of statistical information
@kennytm @3pointer
So we'll be changing how we backup stats, and also support system tables like RBAC info and global variables.
-
We will backup all relevant
mysql.*tables via normal SST backup. This includes user information, statistics, etc. The schema is first renamed totidb_br_temp_mysql.*. Sincemysql.*tables are treated like normal backup, it should never trigger processing logic (1) as they are now equal. -
After restoring the
tidb_br_temp_mysql.*, we will use the SQLREPLACE INTO mysql.xxx SELECT ... FROM tidb_br_temp_mysql.xxx WHERE ...to insert the data back into the realmysql.*tables. These SQL may fail but won't impact overall data as logic (2) suggests.
To further support logic (2), and support backward/forward compatibility,
- BR will never execute ANALYZE.
- We will add a flag
--system-tables='minimal,privileges,statistics'to backup and restore to allow skipping part of system tables when they exist.. - Version compatibility:
-
4.0.8 backup → New restore: If the system table is completely not found in backup archive but
--system-tablesspecifies it, a non-fatal warning is issued about what should be done. For--system-tables=statisticsspecifically, we will list all selected tables for restore as potential targets of ANALYZE. -
4.0.9 backup → New restore: The
Statsfield will just be ignored. BR will ask user to do ANALYZE manually. Also the--ignore-statsflag becomes no-op. -
New backup → 4.0.9 restore: We must never backup the system tables with the name
mysql.*directly, since this will corrupt older clusters who don't understand how to work with them. Therefore, during backup we must already renamemysql.*totidb_br_temp_mysql.*.
-
4.0.8 backup → New restore: If the system table is completely not found in backup archive but
- Feature compatibility:
- We haven't investigated if incremental backup can handle system tables. There are two things to consider:
- the incremental backed up system tables do not contain complete information.
- if a version upgrade happened between last-backupts and backupts, there may be DDLs operating on
mysql.*specifically. They should either be skipped (current situation) or renamed.
- Raw backup/restore is simply irrelevant.
- Online restore - The restore SST part is the same as normal restore. The execute SQL part will be done directly in the BR node. We may use
REPLACE LOW_PRIORITY INTO …if it does make any difference. Each group of SQL execution should happen inside a transaction. There may be some problem when the system table is too large to run in a single transaction. That is, without large transaction support, we may have to split thoseREPLACEstatemeents into multiple self-consistent batches, and this could complicate the logic quite a lot.
- We haven't investigated if incremental backup can handle system tables. There are two things to consider:
-
mysql.usermay have the account information which customer may forget in their brain. - TiDB allow a customer to create a table in the
mysqlschema.
I suggest we should backup the system tables, and can opt out when restoring.
@kennytm @IANTHEREAL
The second reason is very important and critical.
As a matter of principle, we should also back up everything and restore selectively, instead of backuping selectively.