h2database Representation of rows

H2 uses the following ways to represent a row:

Value[] (almost everywhere).
SearchRow (indexes), has different implementations to save some memory.
Row (tables, triggers, DML), has only one implementation that is not efficient for short rows, but MVStore backend and some DML commands use its underlying array, so currently we can't replace it with optimized implementations.
ValueArray (as data type and it MVStore).
ValueRow (as data type of some expressions and in temporary results).

I think the Value[] representation is not modified anywhere after initialization. SearchRow and Row are mutable, so they can't be cached. ValueArray and ValueRow are immutable by design (actually they can be modified, but H2 should never do that).

ValueArray in MVStore is a problematic part, it really should have elements of the same data type when it is used as a normal data type, but MVStore needs elements of different types. @andreitokar Maybe you have some ideas how to remove usage of ValueArray for rows from MVStore without breaking the compatibility with older databases?

We have too many representations of rows from my point of view.

Nov 06 '19 01:11 katzyn

We are a database, so it's natural to have multiple kinds of Row specialised for different tasks.

SearchRow for example, needs to support "not searching for this" column entries.

But certainly if we can simplify the current situation, that is a good thing

Nov 06 '19 06:11 grandinj

I meant that we shouldn't use objects like ValueArray for rows.

I tried to replace ValueArray in primary and secondary indexes with Value[] and with ValueRow using a custom backward-compatible DataType. Read operations from their underlying maps work fine, but there is a some obscure problem in the undo log during write that results in exceptions like

An old transaction with the same id is still open: 1 [1.4.200/102]

Nov 06 '19 06:11 katzyn

I used the following code in that experiment:

public class MVIndexValueDataType extends ValueDataType {

    public MVIndexValueDataType(Database database, int[] sortTypes) {
        super(database, sortTypes);
    }

    @Override
    public int compare(Object a, Object b) {
        if (a == b) {
            return 0;
        }
        return compareArrays((Value[]) a, (Value[]) b);
    }

    @Override
    public int getMemory(Object obj) {
        Value[] values = (Value[]) obj;
        int memory = Constants.MEMORY_ARRAY + Constants.MEMORY_POINTER * values.length;
        for (Value v : values) {
            memory += v.getMemory();
        }
        return memory;
    }

    @Override
    public Object read(ByteBuffer buff) {
        int type = buff.get() & 255;
        if (type != ARRAY) {
            throw DbException.get(ErrorCode.FILE_CORRUPTED_1, "type: " + type);
        }
        int len = readVarInt(buff);
        Value[] list = new Value[len];
        for (int i = 0; i < len; i++) {
            list[i] = (Value) readValue(buff);
        }
        return list;
    }

    @Override
    public void write(WriteBuffer buff, Object obj) {
        Value[] list = (Value[]) obj;
        buff.put(ARRAY).putVarInt(list.length);
        for (Value x : list) {
            writeValue(buff, x);
        }
    }

}

compareArrays() was extracted from the middle of compare() method in superclass. This version is for Value[] and version for ValueRow was very similar with obvious differences.

Primary and secondary indexes were also adjusted. But, unfortunately, it's not enough.

@andreitokar Maybe you can remove usages of ValueArray from indexes? You know that area of H2 much better than me.

Nov 06 '19 13:11 katzyn

BTW, MVSecondaryIndex.Source.Comparator intentionally ignores ASC / DESC / NULLS FIRST / NULLS LAST clauses of the indexed columns, or it's just an omission?

Nov 06 '19 13:11 katzyn

Sure, I can try. Actually, I did same exact exercise two years ago when I was playing with compact representation for native values and db rows backed by generated classes. Now that branch is hopelessly behind and can not be merged easily, but I can use it as a guidance. Anyway, the question is what we replace it with? IMHO, it should be Row/SearchRow, not Value[]? It better represents what data actually is, won't need a hack with row key as last element and give more flexibility for custom implementations.

MVSecondaryIndex.Source.Comparator intentionally ignores ASC / DESC / NULLS FIRST / NULLS LAST clauses of the indexed columns, or it's just an omission?

Of course it is an omission. Comparison should be delegated to DataType retrieved from RowFactory, as in my "compact" branch, just never made it to master. Hopefully it will this time around.

Nov 06 '19 15:11 andreitokar

Row (DefaultRow or some other implementation) looks like the reasonable choice.

Nov 06 '19 15:11 katzyn

If you want a memory compact representation then Value[] is probably the best

Nov 06 '19 16:11 grandinj

I would say that the most compact data representation is some kind of a RowsList interface that is logically N rows x M columns and you can access any primitive value directly just like double getDouble(int rowNum, int columnNum). This would allow to directly access serialized data without creating tons of temporary Value and Row objects and the data can be represented underneath in a very efficient binary form. But this would require lots of changes in the SQL engine.

Nov 06 '19 19:11 svladykin

@svladykin, yes it is good to be wealthy and healthy, but in reality data is mutable and we need rows as separate entities. How do you grow / shrink your RowList, especially in serialized form? You can chunk it, but it quickly become quite complex, considering persistence. We have unified persistence layer as key-value maps, and it forces us to have rows as individually serialized entities. I think at least for now we are settled on this.

Nov 06 '19 22:11 andreitokar

In SQL everything can be NULL, including columns with NOT NULL constraint (in outer joins or in a new row). And this issue is not about an internal representation of a table. This issue is about incorrect usage of array data type for rows; this data type needs to be changed and we need to remove its usages that rely on its old behavior.

And of course we process rows and values during query execution separately.

If you want a memory compact representation then Value[] is probably the best

A Row object will be constructed by the index anyway, so technically it doesn't matter, custom data type can return either a Value[] or a Row directly. But situation is not that simple, I think that custom data type should work with objects that extend a VersionedValue to avoid allocation of a wrapper. Row can extend it (SearchRow actually), but Value[] can't.

Nov 07 '19 01:11 katzyn

@andreitokar I would keep these RowList immutable and have some kind of a RowListBuilder that can join multiple RowLists together, also there must be a method on a RowList to get a range of rows/columns and etc... Think of it like String and StringBuilder. Everything is possible if you are wealthy and healthy enough :)

Nov 07 '19 05:11 svladykin

But again, I agree that it would be a very complex rewrite of the SQL engine and it is better to start with just unifying row representation across all parts of H2 as @katzyn suggested.

Nov 07 '19 05:11 svladykin