docs Feedback: SQL Performance Best Practices - incorrect info on normalization and denormalization pros and cons

Steven Hand (hand-crdb) commented:

Page: https://cockroachlabs.com/docs/v22.1/performance-best-practices-overview.html

What is the reason for your feedback?

[ ] Missing the information I need

[ ] Too complicated

[ ] Out of date

[ ] Something is broken

[x] Other

Additional details

The page on SQL Performance Best Practices for v22.1 has incorrect information on the pros and cons of greater normalization. It says:

Place parts of the records that are modified by different transactions in different tables. That is, increase normalization. However, there are benefits and drawbacks to increasing normalization.

Benefits:

Allows separate transactions to modify related underlying data without causing contention.

Can improve performance for read-heavy workloads.

Drawbacks:

More complex data model.

Increases the chance of data inconsistency.

Increases data redundancy.

Can degrade performance for write-heavy workloads.

I think this discussion gets some things backwards:

Under "Benefits", while it says that increasing normalization "Can improve performance for read-heavy workloads", increasing normalization more often reduces performance for read-heavy workloads. This is because increasing normalization results in more joins and can make the SQL more complicated in other ways. While less normalization generally uses more disk space, increases redundancy, and increases the risk of data inconsistency, it generally speeds reads. This is why less normalization is a standard technique for reporting databases. That said, I would agree that increasing normalization can improve performance for some reads, but that is generally limited to situations in which the data being queried comes from a single table, and that table has fewer columns as a consequence of increased normalization.
Under "Drawbacks", while it says that increasing normalization "Increases the chance of data inconsistency", increasing normalization actually reduces the chance of data inconsistency. This is because a goal of increasing normalization is to store a given business fact in one and only one place, so that it is impossible for multiple copies of the same business fact to become inconsistent with each other.
Under "Drawbacks", while it says that increasing normalization "Increases data redundancy", the reverse is actually true. As mentioned above, a goal of increasing normalization is to reduce or eliminate data redundancy.
Under "Drawbacks", while it says that increasing normalization "Can degrade performance for write-heavy workloads", the reverse is more often true. This is because, with increased normalization, a given business fact needs to be written to one and only one place instead of having to be written to multiple places. Another benefit of increasing normalization for write-heavy workloads is that the rows being written generally have fewer columns. (This is a complicated topic, because increased normalization also increases the number of foreign keys, and thus increases the work needed to validate foreign key constraints. This validation can be more expensive in a distributed database than in a monolithic database. Similarly, increased normalization increases the number of tables, so a storing a given business fact may involve writes to more tables (even though those writes may be smaller), while decreased normalization may involve writes to fewer tables (though those writes may be larger). So performance testing early in the database schema design process is a best practice.)

As generalizations,

More normalization is generally recommended for write-intensive and read/write-intensive transactional workloads.
Less normalization is generally recommended for read-intensive reporting workloads.

Jira Issue: DOC-5447

Aug 23 '22 01:08 hand-crdb

shannonb (shannonbradshaw) commented: Kevin Ngo would it make sense to include these changes as part of the work for https://cockroachlabs.atlassian.net/browse/DOC-6652

Feb 23 '23 14:02 exalate-issue-sync[bot]

shannonb (shannonbradshaw) commented: Ryan Kuo thanks for picking this up and adding to the March sprint.

Feb 24 '23 13:02 exalate-issue-sync[bot]

docs docs copied to clipboard

Feedback: SQL Performance Best Practices - incorrect info on normalization and denormalization pros and cons

What is the reason for your feedback?

Additional details

docs
docs copied to clipboard