gitbase icon indicating copy to clipboard operation
gitbase copied to clipboard

Natural join seems to eliminate rows which it shouldn't

Open alexpdp7 opened this issue 6 years ago • 6 comments

MySQL [gitbase]> select blob_hash, repository_id from blobs natural join repositories where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
+------------------------------------------+-------------------------------------+
6 rows in set (14.90 sec)

MySQL [gitbase]> select blob_hash, repository_id from blobs where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/gitbase            |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/go-mysql-server    |
+------------------------------------------+-------------------------------------+
8 rows in set (0.13 sec)

also note that removing the natural join makes things go much faster- it was my understanding that normally we want to join with repositories to benefit from some specific optimizations (although I'm guessing that filtering with blob_hash makes those optimizations moot).

alexpdp7 avatar Oct 15 '19 08:10 alexpdp7

Normally we don't want to join with repositories unless there are already joins involved. When querying a single table like blobs, they usually have other optimizations in place.

For example, blobs with a filter like blob_hash IN list only reads the given blobs in each repository. That's why no join it's faster.

As with everything: it depends on the query, depending on what you want, some optimizations may be better than others for performance.

In any case, I reproduced the bug and there's actually an issue. It seems to not return the repeated rows for some reason.

erizocosmico avatar Oct 15 '19 09:10 erizocosmico

Yeah, I suspected something like that. Anyway, for my use case lack of duplicated rows is not an issue, so for my this is not high priority.

alexpdp7 avatar Oct 15 '19 09:10 alexpdp7

This bug is really weird. The natural join is the one returning the correct result. If you remove the optimization in blobs table it returns the same. So, there something going on because repo.BlobObjects() doesn't return these blobs, but accessing them directly does

erizocosmico avatar Oct 15 '19 10:10 erizocosmico

@alexpdp7 are you using siva files got from gitcollector?

I tried with regular repositories and it didn't happen.

erizocosmico avatar Oct 15 '19 12:10 erizocosmico

Yup, it's using siva

alexpdp7 avatar Oct 15 '19 13:10 alexpdp7

Narrowed it down to a siva issue and reported it to go-borges: https://github.com/src-d/go-borges/issues/90, so leaving this as blocked until it's solved on their side.

erizocosmico avatar Oct 15 '19 14:10 erizocosmico