database_consistency
database_consistency copied to clipboard
Recommend citext for `case_sensitive: false` validated fields
Does it make sense to do so?
https://www.postgresql.org/docs/13/citext.html has a good rationale for using citext. It's a useful feature for columns like, e.g. emails.
Regarding performance - should be faster compared to LOWER() indexes:
citext is not as efficient as text because the operator functions and the B-tree comparison functions must make copies of the data and convert it to lower case for comparisons. Also, only text can support B-Tree deduplication. However, citext is slightly more efficient than using lower to get case-insensitive matching.
:question: How to detect when a column is a good candidate for citext?
Hey @pirj,
I think that citext is a good and interesting type that should be used carefully as it has some benefits and drawbacks. Currently, I don't have any ideas how we can catch when it's better to use it instead text right now.
The only thing we can do right now is to throw an error if an index LOWER(citext) was created but maybe it isn't allowed by PostgreSQL already.
Do you have any ideas what we can do here?
It feels that there needs to be two checks:
- Check if there's a
case_sensitive: falseuniqueness validation on the model, and check if either the column type iscitextor there's a unique index with aLOWER()on this column. Fail the check otherwise. - If there is a
LOWER()unique index, and suggest usingcitextinstead. But this kind of goes against the responsibility of consistency checker, it's more like a suggestion for more extensive and effective Postgres usage.
Hi @pirj ,
Sorry for taking so long time to finally come back to this.
Regarding point #1. I think we should not do that based on the conclusion here: https://github.com/djezzzl/database_consistency/issues/
About point #2. I like the suggestion. I think this should be implemented as a separate checker. Do you think we should consider only unique indexes? I believe any column that is wrapped with LOWER() is assumed to be queried with LOWER() sometimes, so we may say that it's preferable to have citext anyway.
The only reason to not have citext is when you need both case-sensitive and case-insensitive comparisons. Then we may assume that if there is a LOWER() index and no simple index, we may suggest citext.
Hey @djezzzl !
All good points.
We should check non-unique indexes, too.
According to the latest Posrgres docs, citext is not a panacea, and to specifically make both case-sensitive and case-insensitive queries:
Consider using nondeterministic collations (see Section 24.2.2.4) instead of this module. They can be used for case-insensitive comparisons, accent-insensitive comparisons, and other combinations, and they handle more Unicode special cases correctly.
I recall there are case-insensitive per-column collations in MySQL, too.
I was researching the promotion to use citext. Unfortunately, I couldn't prove it's worth that.
These are my findings:
CREATE EXTENSION citext;
CREATE TABLE text_table(id bigserial primary key, email text, full_name text);
CREATE TABLE citext_table(id bigserial primary key, email citext, full_name text);
INSERT INTO text_table(email) (SELECT MD5(random()::text) FROM generate_series(1,1000000));
INSERT INTO citext_table(email) (SELECT email FROM text_table);
CREATE INDEX text_lower_index ON text_table(LOWER(email));
CREATE INDEX citext_index ON citext_table(email);
ANALYZE text_table;
ANALYZE citext_table;
EXPLAIN ANALYZE SELECT * FROM text_table WHERE LOWER(email) = LOWER('26BB868a515b4B19fe020092339256ca');
-- Index Scan using text_lower_index on text_table (cost=0.42..8.44 rows=1 width=73) (actual time=0.017..0.017 rows=0 loops=1)
-- Index Cond: (lower(email) = '26bb868a515b4b19fe020092339256ca'::text)
-- Planning Time: 0.063 ms
-- Execution Time: 0.027 ms
EXPLAIN ANALYZE SELECT * FROM citext_table WHERE email = '26BB868a515b4B19fe020092339256ca';
-- Index Scan using citext_index on citext_table (cost=0.42..8.44 rows=1 width=73) (actual time=0.043..0.043 rows=0 loops=1)
-- Index Cond: (email = '26BB868a515b4B19fe020092339256ca'::citext)
-- Planning Time: 0.045 ms
-- Execution Time: 0.055 ms
-- I would consider querying is the same.
SELECT pg_size_pretty(pg_relation_size('citext_index')), pg_size_pretty(pg_relation_size('text_lower_index'));
-- both are 56 MB
SELECT pg_size_pretty(pg_table_size('text_table')), pg_size_pretty(pg_table_size('citext_table'));
-- both are 73 MB
I know it's also about simpler querying (no need to specify LOWER) and better comparison for some specific characters but IMHO it isn't enough to promote citext over LOWER().
What do you think?
P.S. Meanwhile, I will do a check to ensure case_sensitive: false for citext fields on uniqueness validations.
P.P.S I'm closing the issue for now. Feel free to reopen it if needed.
Postgres docs mention that the performance difference is barely noticeable.
One reason for recommending I can think of is to have the right index in place without the need to remember it has to include ‘lower()’. But this can be a job for another (existing?) checker without the need to change the column type.
One reason for recommending I can think of is to have the right index in place without the need to remember it has to include ‘lower()’. But this can be a job for another (existing?) checker without the need to change the column type.
I'm not sure I get this.
We have a checker if there is a uniqueness validation with case_sensitive: false, we will ensure the index has lower.