intelmq icon indicating copy to clipboard operation
intelmq copied to clipboard

output/postgresql allow for faster queries in large databases

Open bernhardreiter opened this issue 4 years ago • 6 comments

As described in https://github.com/Intevation/intelmq-fody-backend/issues/8 some events tables grow so big in PostgreSQL that fast queries are not possible with simple indexes that are created by intelmq/bin/intelmq_psql_initdb.py.

Considerations

Want substring and case-insensitivity for string values

Some queries would like to be case insensitive and may also include a substring search for usability reasons. See https://github.com/Intevation/intelmq-fody-backend/issues/26#issuecomment-600728146

Technically this could be made faster by a using a lookup table for the string values, so that the events table itself only holds the a number as index in the lookup table. Doing substring and case-insensitive searches first go into the lookup table and then only a group of integers will be used for the real table.

This would only be worth while if using an integer is an advantage and if the number of unique strings is not high.

How to implement a lookup table

It is possible to chance the postgresql output bot to use a different schema, but a lookup table can also be implemented in the database itself, so that triggers which take care of the necessary indirection and a virtual table can be kept that behaves like the full table.

bernhardreiter avatar Apr 22 '20 13:04 bernhardreiter

You mean normalizing the events table? Make sense, I was also considering this some time ago. That is an easier change to the postgresql output.

However, I also highly recommend looking at something which can easily rotate out data based on the timestamp fields. Especially w.r.t to GDPR.

Will add it to the architecture 3.0 list (as many other good ideas). Thanks!

aaronkaplan avatar May 22 '20 11:05 aaronkaplan

You mean normalizing the events table?

We are not sure what a good, general improvement towards speed could be. We are still in the process of analysing the situation. The installation we are looking at has about 830M entries in the events table and some request are quite slow.

bernhardreiter avatar May 25 '20 08:05 bernhardreiter

Could you tell me which queries are slow?

aaronkaplan avatar May 25 '20 12:05 aaronkaplan

See https://github.com/Intevation/intelmq-fody-backend/issues/8 for some hints upon the queries, Most queries start like the fody Stats page does it, with a range for time.observation if then one or two additional restrictions come in, a query may take several minutes (which is too slow). However it depends a lot if the index is already loaded. As written before, we still have to analyse to understand more deeply what keeps performance back.

bernhardreiter avatar May 25 '20 14:05 bernhardreiter

I added this to the intelmq 3.0 architecture change list.

aaronkaplan avatar Sep 08 '20 17:09 aaronkaplan

@aaronkaplan You are assigned to this issue, are you working on this one or planning so?

ghost avatar Jun 17 '21 19:06 ghost