br icon indicating copy to clipboard operation
br copied to clipboard

BR failed to backup stats

Open 3pointer opened this issue 5 years ago • 14 comments

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error. Upgrade TiDB cluster from v4.0.0 to v4.0.9. then use BR v4.0.9 to backup.

  2. What did you expect to see? backup success.

  3. What did you see instead?

[ERROR] [backup.go:25] ["failed to backup"] [error="[types:1406]Data Too Long, field len 5, data len 6"] [errorVerbose="[types:1406]Data Too Long, field len 5, data len 6\n
github.com/pingcap/errors.AddStack\n
\t	github.com/pingcap/[email protected]/errors.go:174\n
github.com/pingcap/errors.(*Error).GenWithStack\n
\t	github.com/pingcap/[email protected]/normalize.go:147\n
github.com/pingcap/tidb/types.ProduceStrWithSpecifiedTp\n
\t	github.com/pingcap/[email protected]/types/datum.go:995\n
github.com/pingcap/tidb/types.(*Datum).convertToString\n
\t	github.com/pingcap/[email protected]/types/datum.go:942\n
github.com/pingcap/tidb/types.(*Datum).ConvertTo\n
\t	github.com/pingcap/[email protected]/types/datum.go:821\n
github.com/pingcap/tidb/statistics/handle.(*Handle).histogramFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:667\n
github.com/pingcap/tidb/statistics/handle.(*Handle).columnStatsFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:460\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:539\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsToJSON\n
\t	github.com/pingcap/[email protected]/statistics/handle/dump.go:89\n
github.com/pingcap/tidb/statistics/handle.(*Handle).DumpStatsToJSON\n
\t	github.com/pingcap/[email protected]/statistics/handle/dump.go:76\n
github.com/pingcap/br/pkg/backup.BuildBackupRangeAndSchema\n
\t	github.com/pingcap/br@/pkg/backup/client.go:340\n
github.com/pingcap/br/pkg/task.RunBackup\n
\t	github.com/pingcap/br@/pkg/task/backup.go:257\n
github.com/pingcap/br/cmd.runBackupCommand\n
\t	github.com/pingcap/br@/cmd/backup.go:24\n
github.com/pingcap/br/cmd.newFullBackupCommand.func1\n
\t	github.com/pingcap/br@/cmd/backup.go:85\n
github.com/spf13/cobra.(*Command).execute\n
\t	github.com/spf13/[email protected]/command.go:842\n
github.com/spf13/cobra.(*Command).ExecuteC\n
\t	github.com/spf13/[email protected]/command.go:950\n
github.com/spf13/cobra.(*Command).Execute\n
\t	github.com/spf13/[email protected]/command.go:887\n
main.main\n
\t	github.com/pingcap/br@/main.go:58\n
runtime.main\n
\t	runtime/proc.go:203\n
runtime.goexit\n
\t	runtime/asm_amd64.s:1357"] [
  1. What version of BR and TiDB/TiKV/PD are you using? BR v4.0.9 TiDB v4.0.9 (upgrade from v4.0.0)

3pointer avatar Jan 06 '21 02:01 3pointer

The failed reason is DumpsStatsToJson is not compatible when TiDB is upgraded from v4.0.0. To workaround, user can add --ignore-stats=true to command to skip backup stats. and we need find a better way to backup stats and consider with compatibility.

3pointer avatar Jan 06 '21 02:01 3pointer

This is a bug in TiDB. Still, such failure probably shouldn't abort backup/restore since stats aren't essential.

also, the dump stats error should log which table is the cause.

https://github.com/pingcap/br/blob/458d542b19ca8563cac4322888063476cd050c4d/pkg/backup/client.go#L348-L358

kennytm avatar Jan 06 '21 09:01 kennytm

What is the trigger mechanism of dumpStats failure?

stats shouldn't cause the backup to fail. If the stats can be backup&restored, it is good. If BR can't back up stats, it needs to output a backup summary log indicating that backup stats failed; and if there is no stats when restoring, then BR needs to be able to support running analyze table

stats information is stored to backupmeta. what If the entire cluster is backup in a large-scale cluster? how to deal with the large size backupmeta, is it sense?

IANTHEREAL avatar Jan 06 '21 10:01 IANTHEREAL

if the stats field is nil during restore, it will just be skipped.

https://github.com/pingcap/br/blob/458d542b19ca8563cac4322888063476cd050c4d/pkg/restore/client.go#L816-L826

missing stats won't prevent user accessing the data, just that the queries are not optimized. ANALYZE TABLE can be performed separately.

kennytm avatar Jan 06 '21 10:01 kennytm

OK, it can be a workaround method.

For a better user experience, restore needs to support analyze table for tables without stats. Many users may not know analyze table, and restore should be a complete operation, the cluster should be in a good state after restoring

IANTHEREAL avatar Jan 06 '21 13:01 IANTHEREAL

running "analyze table" requires scanning the tables again though, which is slow (esp. when restoring from an old backup where the stats are all missing)

kennytm avatar Jan 07 '21 07:01 kennytm

This is true, but analyze table is necessary. So we should make sure that the backup statistics need to be successful in major situations

IANTHEREAL avatar Jan 07 '21 09:01 IANTHEREAL

As discussed with @kennytm, some consensuses are as follows

  1. the most important thing is that BR needs to ensure the success rate of backup stats to ensure that most of the backup statistics can be backed up successfully, the success rate can reach 99.9%
  2. table restored without stats can be considered unavailable, and requires human or restore program intervention
  3. ignore-stats option of backup is mainly for testing. If someone does backup with ignore-statsrestore does not need to execute analyze table automatically
  4. BR restore should provide an option skip-analyze whose default values is false
  5. the backup that is from tidb (version < 4.0.9) can skip analyze table
  6. If there are table backups that do not contain stats information, the user should be reminded that restore would take a long time
  7. If restore fails to execute analyze table statement, the log will prompt the user to manually execute analyze table, and continue to restore other tables.

IANTHEREAL avatar Jan 07 '21 13:01 IANTHEREAL

my preference is:

4. the option should be --analyze (default is true) which could be set to false for those who really can't wait 5. no need to do version check. we do analyze even if restoring pre-4.0.9 backup archives 8. we should clearly document this analyze behavior.

kennytm avatar Jan 07 '21 13:01 kennytm

I have no problem with these preferences ⏩

For version <4.0.9, I have no special requirements. In order to restore quickly, we can allow the restore of version less than 4.0.9 not to execute analyze table, you can choose it. But I don’t want there are many stats backup failures after version 4.0.9 to avoid too long restore time

IANTHEREAL avatar Jan 07 '21 14:01 IANTHEREAL

I think we made some wrong decisions before, based on the inertia to solve the current problem. I have a new proposal. The product factors considered are as follows: One is that users do not need to know statistical information at all; the other is that we should ensure the success rate of statistical information backup and restoration.

I suggest not to introduce more complicated logic. Because ignore stats already exists, we can temporarily shelve it, but it is recommended not to introduce skip analyze, especially in the future we do not want this parameter in the product, it is only the temporary solution. What we really need is a fast, highly successful statistical information backup and recovery program, unless we can't do it.

The recommended processing logic is as follows

  1. If the backup statistics fails, the backup fails
  2. If the restoration of statistical information fails, the restoration fails, but it can prompt the user that the data has been successfully restored, and you can manually execute analyze table; even support the separate restoration of statistical information

@kennytm @3pointer

IANTHEREAL avatar Jan 18 '21 02:01 IANTHEREAL

So we'll be changing how we backup stats, and also support system tables like RBAC info and global variables.

  1. We will backup all relevant mysql.* tables via normal SST backup. This includes user information, statistics, etc. The schema is first renamed to tidb_br_temp_mysql.*. Since mysql.* tables are treated like normal backup, it should never trigger processing logic (1) as they are now equal.

  2. After restoring the tidb_br_temp_mysql.*, we will use the SQL REPLACE INTO mysql.xxx SELECT ... FROM tidb_br_temp_mysql.xxx WHERE ... to insert the data back into the real mysql.* tables. These SQL may fail but won't impact overall data as logic (2) suggests.

To further support logic (2), and support backward/forward compatibility,

  1. BR will never execute ANALYZE.
  2. We will add a flag --system-tables='minimal,privileges,statistics' to backup and restore to allow skipping part of system tables when they exist..
  3. Version compatibility:
    • 4.0.8 backup → New restore: If the system table is completely not found in backup archive but --system-tables specifies it, a non-fatal warning is issued about what should be done. For --system-tables=statistics specifically, we will list all selected tables for restore as potential targets of ANALYZE.
    • 4.0.9 backup → New restore: The Stats field will just be ignored. BR will ask user to do ANALYZE manually. Also the --ignore-stats flag becomes no-op.
    • New backup → 4.0.9 restore: We must never backup the system tables with the name mysql.* directly, since this will corrupt older clusters who don't understand how to work with them. Therefore, during backup we must already rename mysql.* to tidb_br_temp_mysql.*.
  4. Feature compatibility:
    • We haven't investigated if incremental backup can handle system tables. There are two things to consider:
      1. the incremental backed up system tables do not contain complete information.
      2. if a version upgrade happened between last-backupts and backupts, there may be DDLs operating on mysql.* specifically. They should either be skipped (current situation) or renamed.
    • Raw backup/restore is simply irrelevant.
    • Online restore - The restore SST part is the same as normal restore. The execute SQL part will be done directly in the BR node. We may use REPLACE LOW_PRIORITY INTO … if it does make any difference. Each group of SQL execution should happen inside a transaction. There may be some problem when the system table is too large to run in a single transaction. That is, without large transaction support, we may have to split those REPLACE statemeents into multiple self-consistent batches, and this could complicate the logic quite a lot.

kennytm avatar Jan 19 '21 04:01 kennytm

  1. mysql.user may have the account information which customer may forget in their brain.
  2. TiDB allow a customer to create a table in the mysql schema.

I suggest we should backup the system tables, and can opt out when restoring.

@kennytm @IANTHEREAL

SunRunAway avatar Apr 21 '21 00:04 SunRunAway

The second reason is very important and critical.

As a matter of principle, we should also back up everything and restore selectively, instead of backuping selectively.

SunRunAway avatar Apr 21 '21 00:04 SunRunAway