database_consistency icon indicating copy to clipboard operation
database_consistency copied to clipboard

Recommend citext for `case_sensitive: false` validated fields

Open pirj opened this issue 4 years ago • 2 comments

Does it make sense to do so?

https://www.postgresql.org/docs/13/citext.html has a good rationale for using citext. It's a useful feature for columns like, e.g. emails.

Regarding performance - should be faster compared to LOWER() indexes:

citext is not as efficient as text because the operator functions and the B-tree comparison functions must make copies of the data and convert it to lower case for comparisons. Also, only text can support B-Tree deduplication. However, citext is slightly more efficient than using lower to get case-insensitive matching.

:question: How to detect when a column is a good candidate for citext?

pirj avatar Oct 28 '21 14:10 pirj

Hey @pirj,

I think that citext is a good and interesting type that should be used carefully as it has some benefits and drawbacks. Currently, I don't have any ideas how we can catch when it's better to use it instead text right now.

The only thing we can do right now is to throw an error if an index LOWER(citext) was created but maybe it isn't allowed by PostgreSQL already.

Do you have any ideas what we can do here?

djezzzl avatar Nov 01 '21 12:11 djezzzl

It feels that there needs to be two checks:

  1. Check if there's a case_sensitive: false uniqueness validation on the model, and check if either the column type is citext or there's a unique index with a LOWER() on this column. Fail the check otherwise.
  2. If there is a LOWER() unique index, and suggest using citext instead. But this kind of goes against the responsibility of consistency checker, it's more like a suggestion for more extensive and effective Postgres usage.

pirj avatar Nov 01 '21 15:11 pirj

Hi @pirj ,

Sorry for taking so long time to finally come back to this.

Regarding point #1. I think we should not do that based on the conclusion here: https://github.com/djezzzl/database_consistency/issues/

About point #2. I like the suggestion. I think this should be implemented as a separate checker. Do you think we should consider only unique indexes? I believe any column that is wrapped with LOWER() is assumed to be queried with LOWER() sometimes, so we may say that it's preferable to have citext anyway.

The only reason to not have citext is when you need both case-sensitive and case-insensitive comparisons. Then we may assume that if there is a LOWER() index and no simple index, we may suggest citext.

djezzzl avatar Nov 20 '22 13:11 djezzzl

Hey @djezzzl !

All good points.

We should check non-unique indexes, too.

According to the latest Posrgres docs, citext is not a panacea, and to specifically make both case-sensitive and case-insensitive queries:

Consider using nondeterministic collations (see Section 24.2.2.4) instead of this module. They can be used for case-insensitive comparisons, accent-insensitive comparisons, and other combinations, and they handle more Unicode special cases correctly.

I recall there are case-insensitive per-column collations in MySQL, too.

pirj avatar Nov 20 '22 20:11 pirj

I was researching the promotion to use citext. Unfortunately, I couldn't prove it's worth that.

These are my findings:

CREATE EXTENSION citext;

CREATE TABLE text_table(id bigserial primary key, email text, full_name text);
CREATE TABLE citext_table(id bigserial primary key, email citext, full_name text);

INSERT INTO text_table(email) (SELECT MD5(random()::text) FROM generate_series(1,1000000));
INSERT INTO citext_table(email) (SELECT email FROM text_table);

CREATE INDEX text_lower_index ON text_table(LOWER(email));
CREATE INDEX citext_index ON citext_table(email);

ANALYZE text_table;
ANALYZE citext_table;

EXPLAIN ANALYZE SELECT * FROM text_table WHERE LOWER(email) = LOWER('26BB868a515b4B19fe020092339256ca');
-- Index Scan using text_lower_index on text_table  (cost=0.42..8.44 rows=1 width=73) (actual time=0.017..0.017 rows=0 loops=1)
--   Index Cond: (lower(email) = '26bb868a515b4b19fe020092339256ca'::text)
-- Planning Time: 0.063 ms
-- Execution Time: 0.027 ms

EXPLAIN ANALYZE SELECT * FROM citext_table WHERE email = '26BB868a515b4B19fe020092339256ca';
-- Index Scan using citext_index on citext_table  (cost=0.42..8.44 rows=1 width=73) (actual time=0.043..0.043 rows=0 loops=1)
--   Index Cond: (email = '26BB868a515b4B19fe020092339256ca'::citext)
-- Planning Time: 0.045 ms
-- Execution Time: 0.055 ms

-- I would consider querying is the same.

SELECT pg_size_pretty(pg_relation_size('citext_index')), pg_size_pretty(pg_relation_size('text_lower_index'));
-- both are 56 MB

SELECT pg_size_pretty(pg_table_size('text_table')), pg_size_pretty(pg_table_size('citext_table'));
-- both are 73 MB

I know it's also about simpler querying (no need to specify LOWER) and better comparison for some specific characters but IMHO it isn't enough to promote citext over LOWER().

What do you think?

P.S. Meanwhile, I will do a check to ensure case_sensitive: false for citext fields on uniqueness validations.

P.P.S I'm closing the issue for now. Feel free to reopen it if needed.

djezzzl avatar Dec 03 '22 14:12 djezzzl

Postgres docs mention that the performance difference is barely noticeable.

One reason for recommending I can think of is to have the right index in place without the need to remember it has to include ‘lower()’. But this can be a job for another (existing?) checker without the need to change the column type.

pirj avatar Dec 03 '22 18:12 pirj

One reason for recommending I can think of is to have the right index in place without the need to remember it has to include ‘lower()’. But this can be a job for another (existing?) checker without the need to change the column type.

I'm not sure I get this.

We have a checker if there is a uniqueness validation with case_sensitive: false, we will ensure the index has lower.

djezzzl avatar Dec 04 '22 08:12 djezzzl