rum icon indicating copy to clipboard operation
rum copied to clipboard

Wrong results returned from RUM index with order_by_attach=TRUE

Open michaelwu0505 opened this issue 5 years ago • 0 comments

Below are two cases showing wrong results returned from RUM index with order_by_attach=TRUE. In both test cases, if order_by_attach=FALSE, correct results will be returned.

Tested with PostgreSQL 12.2 & newest commit e34375aa36692003cc3a3bc7ec84c252831128c7.

CASE 1:

CREATE TABLE test (
	id bigint NOT NULL,
	folder bigint NOT NULL,
	time bigint NOT NULL,
	tsv tsvector NOT NULL
)
INSERT INTO test (id, folder, time, tsv) VALUES (1, 10, 100, to_tsvector('wordA'));
INSERT INTO test (id, folder, time, tsv) VALUES (2, 20, 200, to_tsvector('wordB'));
INSERT INTO test (id, folder, time, tsv) VALUES (3, 10, 300, to_tsvector('wordA'));
INSERT INTO test (id, folder, time, tsv) VALUES (4, 20, 400, to_tsvector('wordB'));

Below shows expected results when select without index:

SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint)

Returns rows with id 1 and 3.

SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint) ORDER BY time <=| 500::bigint

Returns rows with id 3 and 1.

After creating the following index, wrong results will be returned:

CREATE INDEX test_idx ON test USING rum(folder, tsv rum_tsvector_addon_ops, time) with (attach = 'time', to = 'tsv', order_by_attach=TRUE);
SET enable_seqscan = OFF;
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint)

Wrong result: Returns only row with id 1. (Expects 1 and 3)

SET enable_seqscan = OFF;
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint) ORDER BY time <=| 500::bigint

Wrong result: Returns nothing. (Expects rows with id 3 and 1)

CASE 2:

CREATE TABLE test2 (
	id bigint NOT NULL,
	time bigint NOT NULL,
	tsv tsvector NOT NULL
)
CREATE OR REPLACE PROCEDURE test2_init()
AS $$
DECLARE
	counter INTEGER := 1; 
	str TEXT;
	time BIGINT;
BEGIN
	WHILE counter <= 1000 LOOP
	
		IF counter % 10 = 0 THEN
			str := 'wordA wordB';
		ELSEIF counter % 11 = 0 THEN
			str := 'wordA wordB wordC';
		ELSE
			str := 'wordA wordD';
		END IF;
		
		-- insert rows with alternating time
		IF counter % 2 = 0 THEN
			time := counter;
		ELSE
			time := -counter;
		END IF;
		
		
		INSERT INTO test2 (id, time, tsv) VALUES (counter, time, to_tsvector(str));
		
		counter := counter + 1;
	END LOOP;
END
$$
LANGUAGE plpgsql;

CALL test2_init();

Below shows expected results when select without index:

SELECT * from test2 where tsv @@ (to_tsquery('wordA & wordB')) ORDER BY time <=| 1001::bigint 

Returned 181 rows.

After creating the following index, wrong results will be returned:

CREATE INDEX test2_idx ON test2 USING rum(tsv rum_tsvector_addon_ops, time) with (attach = 'time', to = 'tsv', order_by_attach=TRUE);
SET enable_seqscan = OFF;
SELECT * from test2 where tsv @@ (to_tsquery('wordA & wordB')) ORDER BY time <=| 1001::bigint 

Wrong result: Returns only 153 rows. (Expects 181 rows)

If the above query changes from ORDER BY time <=| 1001::bigint to ORDER BY <=> 1001::bigint, then correct number of rows will be returned.

michaelwu0505 avatar Feb 22 '20 12:02 michaelwu0505