Vertica-Extension-Packages icon indicating copy to clipboard operation
Vertica-Extension-Packages copied to clipboard

compatlib_function - ConnectBy supports VARCHAR inputs

Open h-serizawa opened this issue 2 years ago • 3 comments

Currently, ConnectBy supports INTEGER inputs as parent ID and child ID. Now, it supports VARCHAR inputs as well. Close #69.

h-serizawa avatar Mar 24 '22 07:03 h-serizawa

[Test Data]

=> CREATE TABLE company (id INT, supervisor_id INT, name VARCHAR(20));
=> INSERT INTO company VALUES (1, null, 'Patrick');
=> INSERT INTO company VALUES (2, 1, 'Jim');
=> INSERT INTO company VALUES (3, 1, 'Sandy');
=> INSERT INTO company VALUES (4, 3, 'Brian');
=> INSERT INTO company VALUES (4, 3, 'Otto');
=> COMMIT;

[Test Result with INTEGER inputs]

=> SELECT connect_by_path(supervisor_id, id, name, ' >> ') OVER () FROM company;

 identifier | depth |           path
------------+-------+--------------------------
          1 |     0 | Patrick
          2 |     1 | Patrick >> Jim
          3 |     1 | Patrick >> Sandy
          4 |     2 | Patrick >> Sandy >> Otto
(4 rows)

[Test Result with VARCHAR inputs]

=> SELECT connect_by_path(supervisor_id::VARCHAR, id::VARCHAR, name, ' >> ') OVER () FROM company;

 identifier | depth |           path
------------+-------+--------------------------
 1          |     0 | Patrick
 2          |     1 | Patrick >> Jim
 3          |     1 | Patrick >> Sandy
 4          |     2 | Patrick >> Sandy >> Otto
(4 rows)

h-serizawa avatar Mar 25 '22 00:03 h-serizawa

Thank you!

I have one reservation.

You give two factories: ConnectByIntFactory and ConnectByVarcharFactory to provide two interfaces to the same function (by reading ints as strings). This means that someone using ints will pay a conversion price to strings. In addition, all the map operations will be done on string keys and not int keys, which I suspect is less efficient.

If one's tables are small, this is probably not such a big concern, but Vertica is used with tables that have billions of rows.

Given that many (possibly most) uses of this function will be using integer keys from a dimension table, I think it might be better to have separate ConnectByInt and ConnectByVarchar implementations which only do conversions when needed.

I wonder if you could implement a template that does the algorithmic work in processPartition to avoid duplication?

Does that seem reasonable to you?

dmankins avatar Apr 01 '22 16:04 dmankins

@dmankins Thank you for your valuable feedback! Yes, the data conversion from int to string will be a performance concern. I removed the conversion using the template. Can you review this request again?

h-serizawa avatar Apr 07 '22 07:04 h-serizawa