dolt
dolt copied to clipboard
`dolt diff` shows NULL values for virtual generated columns
Repro steps:
create table test(pk int primary key, c0 int generated always as (pk + 1));
insert into test(pk) values (1), (2);
dolt diff will produce the following output:
diff --dolt a/test b/test
added table
+CREATE TABLE `test` (
+ `pk` int NOT NULL,
+ `c0` int GENERATED ALWAYS AS ((`pk` + 1)),
+ PRIMARY KEY (`pk`)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;
+---+----+------+
| | pk | c0 |
+---+----+------+
| + | 1 | NULL |
| + | 2 | NULL |
+---+----+------+
This is likely because c0 is virtual, so the tuples in the table don't actually contain a value for it. Displaying the correct values would require evaluating the stored expression for that column, which requires a SQL engine.
As for the correct behavior here, I'm leaning towards excluding virtual generated columns from dolt diff entirely, since they aren't actually part of the data in the diff.
hi, I would like to take this issue. could you assign it to me?
Sure! Thank you for your time!
This likely happened because Dolt previously didn't support virtual columns. And adding virtual columns complicates things.
When there are no virtual columns, there's no difference between "table rows stored on disk" and "table rows seen by the engine". But once a table has virtual columns, that distinction matters, and any code that assumes that the two are identical may get incorrect results.
In this case, it's likely that the dolt diff code is reading the rows directly out of storage and assuming that they match the table schema. Due to the way we store rows, adding a NULL to the end of a row doesn't change the bytes on disk: so for example, the row (1, NULL) and the row (1,) have the exact same bytes. So dolt diff is likely reading the latter and treating it like it's the former.
I think that the best solution here is to simply omit any virtual generated columns from the output, since computing the correct value for the column would be more complicated and potentially slow.
The code for the diff command is in go/cmd/dolt/commands/diff.go. Most of the logic for the command works by writing SQL queries against system tables. Most likely this will involve changing the behavior of the dolt_diff() system table function to not have columns for virtual generated columns. That table function is defined in go/libraries/doltcore/sqle/dtablefunctions/dolt_diff_table_function.go.
Let me know if you have any questions! There's a lot going on in these files because they interact with the SQL engine, but stepping through a sample dolt diff command in the debugger should help you figure out what each part is doing.
Is anyone still working on this? I'm happy to work on this or #5861, whichever y'all think would be best to do first for dolt. Thanks.
I don't think that we can simply drop these columns from the table. There are use cases where you would want them, and users could be depending on them today.
One solution would be a session variable to indicate that the user does or does not want the table. We can make the default not include them, and then in the upgrade notes we can instruct users to set this variable if they want the columns.
@macneale4 raises a good point, and he's convinced me that my original suggestion was wrong: we shouldn't just be dropping the columns in the results.
What makes this issue tricky is that in Dolt, there's two different structs for accessing tables: one for accessing tables via a SQL engine (sqle.DoltTable) and one for interacting with the storage layer directly (doltdb.Table). The diff algorithm accesses the tables directly through the storage layer, but the method for resolving virtual columns exists entirely in the SQL engine. Both interfaces have access to the same underlying data, so incorporating both is possible, but it's non-obvious how to do it. I think I was wrong to label this was "good first issue", because getting this right is going to be complicated.
thank you both. let me think about this and circle back with some questions
@nicktobey, are you saying there is another way to accomplish the same thing or just that it was tad harder than you initially thought. I haven't thought much about it, but I've only really though about the session idea which would simply do as Neil suggested while keeping most of or the general gist the code in the PR I closed.
@codeaucafe I'm saying that I think the desired behavior is more complicated than I originally described in the issue, and this PR, while it matches what I proposed in the original writeup, doesn't fully solve the issue because it doesn't handle the case where we do want virtual columns in the diff and need to emit correct values for them.
I really appreciate you creating the PR for this. And if we go with Neil's suggestion to have a session variable control whether virtual columns are emitted in diffs, then your PR would be an excellent way to implement the behavior where we want to remove those columns from the diff.
But I'm concerned about the complexity of adding an additional session variable and code path when it ultimately doesn't let us close this issue. So I don't think we should merge this until we also fix the behavior for when virtual columns are displayed.
@nicktobey no problem, I'm happy to help. Thank you for clarifying.
I'd be more than happy to help out working on the correct fix showing and not showing virtual columns in the diff when they should be shown and not shown, respectively.
However, note that I'll be pretty busy all of September due to work and travel visiting family, so I may not really work much on it until the very end of Sept. If no one has started working on it by end of September, then I'd be happy to pick it back up on it. I'll sync with y'all here and/or discord.
In meantime, I'll try and debug it further and better understand engine and storage layers, and ask questions in discord as they pop up.
Btw I closed the corresponding PR last given what we've discussed.
Thanks again for your assistance as well as the team's help in general.
cheers