cstore_fdw Unexpectedly bad performance

cstore_fdw 1.6.0 on postgresql-10

So, I've got a lot of time-series data in a table that looks like this:

spider=> create foreign table history_y2017 (
    itemid bigint not null,
    clock integer not null,
    value numeric(20,0) not null,
    ns integer default 0 not null,
    constraint check_clock CHECK ( clock >=  1483228800 AND clock < 1514764800)
)
SERVER cstore_server OPTIONS(compression 'pglz');
insert into history_y2017 select * from history where clock >= 1483228800 AND clock < 1514764800 order by itemid,clock,ns;
analyze history_y2017;

 spider=> explain analyze verbose select * from history_y2017 where itemid=1464 and clock 1497582092 and clock <= 1497668492;
                                                              QUERY PLAN                                                             
 ------------------------------------------------------------------------------------------------------------------------------------
  Foreign Scan on public.history_y2017  (cost=0.00..12087188.56 rows=851 width=20) (actual time=995.497..1276.525 rows=1440 loops=1)
    Output: itemid, clock, value, ns
    Filter: ((history_y2017.clock > 1497582092) AND (history_y2017.clock <= 1497668492) AND (history_y2017.itemid = 1464))
    Rows Removed by Filter: 28560
    CStore File: /var/lib/pgsql/10/data/cstore_fdw/16385/17191
    CStore File Size: 3549773603
  Planning time: 0.408 ms
  Execution time: 1278.834 ms

Benchmarking it on a select benchmark gives me ~3 tps...

Changing the block_row_count from default to '1000' only serves to make the matter four times worse.

Jan 04 '18 15:01 Spindel

Some slowdown is expected when you have such low block row count. Reducing block row count causes more disk accesses when writing and reading.

Why do you need to change this ?

Jan 04 '18 15:01 mtuncer

That was an experiment, I'm trying to get this query up to at least approximate the time it takes on a normal table with a brin index.

Perhaps the question I should ask is this:

A normal table partition, brin index ( blocks per range =16) on this dataset gives a tps of ~ 130. Not too great, but acceptable.
The same in cstore_fdw (default settings) gives a tps of 3.1 .

So, is my data layout "wrong" for this problem/ query style, and should I simply look at some other solution for effectively storing timeseries data, or is there something I have missed in either my data layout (sorting, etc) that would improve this?

Jan 04 '18 15:01 Spindel

That table is only about 3.3 GB, correct? For us, the performance comes in to play when you have a really big table (at least bigger than avail RAM.), and when you have a wide table where you are only interested in a few columns. column store tables' performance is better when you start hitting disk really hard.

Another thought. The table is mostly numeric. pglz compression may not compress your table enough to offset the CPU required to perform the compression. You may try it without compression.

HTH, ADam

On Thu, Jan 4, 2018 at 7:59 AM, D. Spindel [email protected] wrote:

That was an experiment, I'm trying to get this query up to at least approximate the time it takes on a normal table with a brin index.

Perhaps the question I should ask is this:

A normal table partition, brin index ( blocks per range =16) on this dataset gives a tps of ~ 130. Not too great, but acceptable.

The same in cstore_fdw (default settings) gives a tps of 3.1 .

So, is my data layout "wrong" for this problem/ query style, and should I simply look at some other solution for effectively storing timeseries data, or is there something I have missed in either my data layout (sorting, etc) that would improve this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/citusdata/cstore_fdw/issues/174#issuecomment-355320110, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNCFtzHadrybDadcAlOllX9Fz8QclhOks5tHPVfgaJpZM4RTIvn .

Jan 04 '18 16:01 acscott

The data table is approximately 25 gigs uncompressed, compresses down to a rather respectable 3-5 gigabytes in size. This is a small subset of 1 year data, 2000 sensors of variable time slice.

Basically, we have a "hot" section of data (last few weeks/months) And then colder storage data. Different datatypes in different tables, etc. Currently I'm measuring integers, since those are most common, (by number) and by size the largest.

Jan 04 '18 16:01 Spindel

That is really good compression. At 3 GB compressed, all of the data can fit into file system cache so your performance bottleneck is probably CPU level. Watching top while running the query should confirm my guess.

If this is the case, you may consider increasing stripe_row_count for further experimenting. But, I would suggest you need bigger data to experiment with to start to see performance benefits. I believe it is when you start becoming disk bound that you will see the biggest bang.

HTH, Adam

On Thu, Jan 4, 2018 at 8:32 AM, D. Spindel [email protected] wrote:

The data table is approximately 25 gigs uncompressed, compresses down to a rather respectable 3-5 gigabytes in size. This is a small subset of 1 year data, 2000 sensors of variable time slice.

Basically, we have a "hot" section of data (last few weeks/months) And then colder storage data. Different datatypes in different tables, etc. Currently I'm measuring integers, since those are most common, (by number) and by size the largest.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/citusdata/cstore_fdw/issues/174#issuecomment-355329698, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNCFr0txC30fIXirV095HoVeapZxzxwks5tHP0ygaJpZM4RTIvn .

Jan 04 '18 17:01 acscott

Well, we are (in live) already disk IO bound, and hitting massive slowdowns there, because a single seek will blow much of the dcache, etc.

However, the current "comparision" for cstore when it comes to compression on this dataset is a btrfs partition with compression enabled + brin index (pages per block at 16)

This combination uses ~4% more disk than cstore for the same dataset, while being approximately 700x faster for queries.

It's that difference that's really causing me some confusion on the performance here.

Jan 04 '18 17:01 Spindel

So, testing shows that for my workload:

Order by the most common selector when inserting (itemid,clock) rather than the natural insert order (clock, itemid)

Causes the difference between extra rows returned to go (for a 24-hour query) from 1828560 to 28560

This change, together with increasing stripe_row_count to 500000 brings the query plan down to <0.3 seconds

Jan 04 '18 18:01 Spindel

So, summarizing the tests:

type	data	compressed	block_row_count	stripe_row_count	sorting	select-tps
cstore_fdw	75GB	9.4 GB	50000	5000000	itemid,clock	11.29
btrfs+brin	75GB	11.9GB	-----------------	--------------------	itemid,clock( insert order)	166.52
xfs+brin	75GB	75GB	------------------	--------------------	itemid,clock	187

Benchmark ran:

\set itemid random(1, 5000)
\set stop random(1445802188, 1514922128)
\set interval random(1, 24)
\set start :stop - (3600 * :interval)
\set peek 1514828632
SELECT clock,ns,value from history where itemid = :itemid and clock > :start and clock <= :stop;
SELECT clock,ns,value from history where itemid = :itemid and clock > :peek  order by clock desc limit 1;

So, is there something obvious other than the stripe_row_count that I've missed? I ran tests with it at 750 000 as well, but there wasn't a discernable difference other than a slightly higher compression degree.

Bouncing block_row_count up to 50k increased performance of selects but returned more rows by the filters, so I think that's a mis-optimization on this workload ( We don't usually paginate more than 24-hour blocks due to the current design issues)

These are still synthetic benchmarks, and on datasets that aren't in the same dimension as our production sets, where a single query of this type takes > 60 seconds and completly thrashes the cache and binds up the system in disk page-in/out.

And honestly, since cstore_fdw is available for us without a kernel upgrade, that may be a more useful step anyhow, but I'm still trying to figure out why the performance is below where I'd expect it to be.

Jan 05 '18 15:01 Spindel