DRILL-8188: Convert HDF5 format to EVF2

Description

Use EVF V2 instead of old V1.

Included two bug fixes with the V2 framework :

Projected an unprojected column error in array object
IndexOutOfBoundsException at add column

Documentation

N/A

Testing

Use the CI.
Add the tests for the bugfix.

Apr 04 '22 14:04 luocooong

This PR is getting a bit complex with the bug or two that this PR uncovered. Iexplain a bit about how EVF2 works. There are two case: wildcard projection (SELECT *) and explicit projection (SELECT a, b, c). The way EVF2 works is different in these two cases.

Then, for each reader, there are three other cases. The reader might know all its columns before the file is even opened. The PCAP reader is an example: all PCAP files have the same schema, so we don't need to look at the file to know the schema. The second case are files were we can learn the schema when opening the file. Parquet and CSV are examples: we can learn the Parquet schema from the file metadata, and CSV schema from the headers. The last case is where we don't know the schema until we read each row. JSON is the best example.

So, now we have six cases to consider. This is why EVF2 is so complex!

For the wildcard, EVF2 "discovers" columns as the reader creates them: either via the up-front schema, or as the reader reads data. In JSON, for example, we can discover a new column at any time. Once a column is added, EVF2 will automatically fill in null values if values are missing. In the extreme case, it can fill in nulls for an entire batch. Because of the wildcard, all discovered columns are materialized and added to the result set. If reading JSON, and a column does not appear until the third batch, then the first two won't contain that column, but the third batch will have a schema change and will include the column. This can cause a problem for operators such as joins, sort or aggregation that have to store a collection of rows, not all can handle a schema change.

Now, for the explicit schema case, EVF2 knows what columns the user wants: those in the list. EVF2 waits as long as it can, hoping the reader will provide the columns. Again, the reader can provide them up front, before the first record, or as the read proceeds (as in JSON.) As the reader provides each column, EVF2 has to decide: do we need that column? If so, we create a vector and a column writer: we materialize the column. If the column is not needed, EVF2 creates a dummy column writer. Now the interesting part. Suppose we get to the end of the first batch, the query wants column c, and the reader has never defined column c? What do we do? In this case, we have to make something up. Historically, Drill would make up a Nullable Int, with all-null values. EVF added the ability to specify the type for such columns, and we use that. If a provided schema is available, then the user tells us the type.

Now we get to another interesting part. What if we guessed, say, Varchar, but the column later shows up as a JSON array? We're stuck: we can't go back and redo the old batches. We end up with a "hard" schema change. Bad things happen unless the query is really simple. This is the fun of Drill's schemaless system.

With that background, we can try to answer your question. The answer is: it depends. If the reader says, "hey Mr. EVF2, here is the full schema I will read, I promise not to discover more columns", then EVF2 will throw an exception if later you say, "ha! just kidding. Actually, I discovered another one." I wonder if that's what is happening here.

If, however, the reader left the schema open, and said, "here are the columns I know about now, but I might find more later", then EVF2 will expect more columns, and will handle them as above: materialize them if they are projected or if we have a wildcard, provide a dummy writer if we have explicit projection and the column is not projected.

In this PR, we have two separate cases in the reader constructor.

In the if path, we define a "reader schema", and reserve the right to add more columns later. "That's what the false argument means to tableSchema().
In the else path, we define no schema at all: we don't all tableSchema().

This means the reader is doing two entirely different things. In the if case, we define the schema and we just ask for column writers by name. In the else case, we don't define a schema, and we have to define the column when we ask for the column writers.

This seems horribly complicated! I wonder, are we missing logic in the then case? Or, should there be two distinct readers, each of which implements one of the above cases?

Apr 19 '22 05:04 paul-rogers

Found the bug. It is in ColumnBuilder which seems to be missing code to handle an unprojected repeated list. This bug then caused the other "bugs" that we discussed in the review: those bits of code are working as they should. The problem is that the result set loader is materializing a vector when it should not. It will take some time to remember how all this stuff works. Stay tuned.

Apr 25 '22 07:04 paul-rogers

Found the bug. It is in ColumnBuilder which seems to be missing code to handle an unprojected repeated list. This bug then caused the other "bugs" that we discussed in the review: those bits of code are working as they should. The problem is that the result set loader is materializing a vector when it should not. It will take some time to remember how all this stuff works. Stay tuned.

@paul-rogers Great! Thank you for the quick work. From the end of the most recent discussion, I completely rejected my previous code revision, guess that the unprojected handle might have been lost. Actually, I've added this function locally, but I'm not sure it's correct. Would you mind checking mine before you submit the new revision?

Apr 25 '22 08:04 luocooong

Hi @luocooong Thank you for this PR. Where are we in terms of getting it merged?

May 26 '22 01:05 cgivre

Hi @cgivre Thank you for paying attention to this PR. The pull request cannot be merged now, and Paul is going to re-review the V2 section code. and we're going to fix the bugs above from the framework.

May 26 '22 01:05 luocooong

Converted to draft to prevent merging.

Jul 11 '22 08:07 jnturton

Hey @luocooong @paul-rogers I hope all is well. I wanted to check in on this PR to see where we are. At this point, nearly all the other format plugins have been converted to EVF V2.

The other outstanding ones are the image format and LTSV. I'd really like to see this merged so that we can remove the EVF V1 code.

Do you think we could get this ready to go soon?

Nov 02 '22 14:11 cgivre

I think I hosed the version control somehow.... This PR should only modify a few files in the HDF5 reader.

Jan 09 '24 02:01 cgivre

It seems you did this work on top of the master with my unsquashed commits. When you try to push, those commits come along for the ride. I think you should grab the latest master, then rebase your branch on it.

Plan B is to a) grab the latest master, and b) create a new branch that cherry-picks the commit(s) you meant to add.

If even this doesn't work, then I'll clean up this branch for you since I created the mess in the first place...

Jan 09 '24 02:01 paul-rogers

@paul-rogers I attempted to fix. I kind of suck at git, so I think it's more or less correct now, but there was probably a better way to do this.

Jan 09 '24 04:01 cgivre

@paul-rogers I attempted to fix. I kind of suck at git, so I think it's more or less correct now, but there was probably a better way to do this.

I think you still want something like

git pull --rebase upstream master
git push --force-with-lease

Jan 09 '24 04:01 jnturton

I see Git's "patch contents already upstream" feature doesn't automatically clean up the unwanted commits. I've dropped them manually in a new branch in my fork and now suggest

git reset --hard origin/master
git pull --rebase https://github.com/jnturton/drill.git 8188-hdf5-evf2
git push --force # to luocooong's fork

Jan 09 '24 04:01 jnturton

@jnturton I did as you suggested. Would you mind please taking a look?

Jan 10 '24 15:01 cgivre

@paul-rogers I attempted to fix. I kind of suck at git, so I think it's more or less correct now, but there was probably a better way to do this.

Just workng through the review comments that @paul-rogers left (the ones unrelated to the needed functionality that was missing from EVF2).

Jan 11 '24 09:01 jnturton

Did the recent EVF revisions allow the tests for this PR to pass? Is there anything that is still missing? Also, did the excitement over my botched merge settle down and are we good now?

Jan 11 '24 20:01 paul-rogers

Did the recent EVF revisions allow the tests for this PR to pass? Is there anything that is still missing? Also, did the excitement over my botched merge settle down and are we good now?

All the unit tests pass.... whether that means that everything is working.... this plugin has a decent amount of tests, so I'd feel pretty good.

Jan 11 '24 22:01 cgivre

drill drill copied to clipboard