drill
drill copied to clipboard
DRILL-7745: Add storage plugin for IPFS
Add storage plugin for IPFS. See detailed introduction here.
TODOs:
- [x] Port to Drill 1.18.0
- [ ] Add more tests
- [x] Support JSON reader
Authors:
This storage plugin was contributed by Bowen Ding, Yuedong Xu and Liang Wang. The authors are affiliated with Fudan University.
@dbw9580 Thanks for contributing this. Do you want review comments now?
Yes, please.
Some major problems I'm working on are:
- support for different data formats. Currently only JSON files are supported, and CSV support is removed due to changes introduced in v1.17 and v1.18. The original implementation was a copy-paste of the easy format plugin (
org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
). I wonder if there is a better way to reuse the code there. - support for
CREATE TABLE
statements. This will require changes to the Drill framework responsible for SQL parsing, query plan generation, etc. I would appreciate opinions from you.
Added small refactoring ideas. Did not check implementation details
@dbw9580 One more comment. You'll want to add your plugin to the distribution files so that it will be built when Drill is built.
You'll have to do that here: https://github.com/apache/drill/blob/7d5b6116ba524769f8ba43ff03291eff62de1205/distribution/pom.xml#L300-L304
and here: https://github.com/apache/drill/blob/7d5b6116ba524769f8ba43ff03291eff62de1205/distribution/src/assemble/component.xml#L28-L56
@dbw9580 We're looking to get a release of Drill out by the end of June. If you can get these revisions done soon, I can help expedite the review so that we can get this (or at least an MVP of this) into Drill 1.18.
Yes I'll try my best. I was stuck with CSV and writer support, and haven't had much progress so far. Can we settle with basic JSON and reader support for now, and maybe add those later?
@dbw9580 Why don't you break out CSV and the writer support as separate PRs. Then we can get this in, and work on those for the next release.
I've never done a writer, but I can assist with the CSV reader. Take a look here: https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVBatchReader.java
The HTTP storage plugin can read either JSON
or CSV
. I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.
break out CSV and the writer support as separate PRs.
Great. Will do.
I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.
Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?
@dbw9580 I had another thought about the design. I wonder if instead of "rolling your own" readers, if you could do what the file plugin does and make use of all the existing format plugins. Basically, that way any new file format for which there is a format plugin could also be read by IPFS. I don't know IPFS that well to know if that would even make sense, or not.
That also might simplify the code a lot.
@dbw9580 This is definitely making progress. Will testing require an IPFS installation?
Yes, a running IPFS daemon is required.
@dbw9580 This is definitely making progress. Will testing require an IPFS installation?
Yes, a running IPFS daemon is required.
Can you write some tests that do not require the IPFS daemon? I realize that we'll need it for some unit tests, but is it possible to test various components w/o the daemon?
some tests that do not require the IPFS daemon
I guess for the storage plugin config, scanSpec, etc, they are static and independent on a running IPFS daemon. Also for the json reader, I think I can supply test data from memory directly, w/o actually retrieving it from IPFS.
@cgivre I've added more tests. The tests are not passing, something about Error while applying rule DrillScanRule
. However, I was able to successfully execute the test queries through Drill web interface. I don't know how to fix these tests?
Edit: attach log file. org.apache.drill.exec.store.ipfs.TestIPFSQueries.txt
@cgivre I've added more tests. The tests are not passing, something about
Error while applying rule DrillScanRule
. However, I was able to successfully execute the test queries through Drill web interface. I don't know how to fix these tests?Edit: attach log file. org.apache.drill.exec.store.ipfs.TestIPFSQueries.txt
It appears that the query is not getting through the planning phase. My suggestion is to take a look at this tutorial about writing storage plugins: https://github.com/paul-rogers/drill/wiki/Storage-Plugin, and and specifically, follow the debugging procedures that Paul outlines. My hunch here is that something is going wrong with the schema resolution.
@cgivre it turned out what was blocking the tests was that the default number of providers in test config was too large, as a result IPFS could not find any other providers in time, thus the TimeoutException
s. I wish the test logs had included full stack traces, which could have saved me hours looking into the Drill planner internals... 😓
Cleaning up the PR. I was thinking about the unit tests and it might be good to include unit tests using Mockito to mock up some of the various components. That way we can test at least some of this without the IPFS daemon. I can post an example if you'd like.
Would appreciate that.
Cleaning up the PR. I was thinking about the unit tests and it might be good to include unit tests using Mockito to mock up some of the various components. That way we can test at least some of this without the IPFS daemon. I can post an example if you'd like.
Would appreciate that.
Take a look here for an example:
https://github.com/apache/drill/blob/5900cdfaae20e216d4b87795bd2efc8199e648e6/contrib/storage-elastic/src/test/java/org/apache/drill/exec/store/elasticsearch/ElasticSearchGroupScanTest.java#L42-L96
@dbw9580 I'll take a look over the weekend. Thanks for the contribution!
@dbw9580
Please verify that the project builds and passes all checkstyles. TestIPFQueries
fails the checkstyle due to unused imports.
TestIPFQueries
fails the checkstyle due to unused imports. @cgivre Hmm I don't see any unused imports in this file and my builds are passing.
@dbw9580 I redownloaded and it built for me. Please disregard previous comments.
I'm still having issues actually getting the unit tests that require the IPFS daemon to actually execute.
@cgivre Actually I am having trouble making that test run, too. I keep getting errors like "connection rejected: /127.0.0.1:31011" or "Protocol family unavailable: /0:0:0:0:0:0:0:1:31011". I can test successfully manually through the web ui with drill-embedded, though.
Can you try testing through the web ui, too? The simple dataset should be easy to add to IPFS and test:
ipfs object patch set-data $(ipfs object new) <path-to-simple-dataset.json>
This will return the hash of the simple dataset, which is QmcbeavnEofA6NjG7vkpe1yLJo6En6ML4JnDooDn1BbKmR
.
Then run a query through the web ui: select * from ipfs.`/ipfs/QmcbeavnEofA6NjG7vkpe1yLJo6En6ML4JnDooDn1BbKmR#json`
.
If the query takes too long to complete, try reducing the timeout values as well as the max-peers-per-leaf
value in the plugin config.
If I leave an instance of Drill running and then run the unit test (TestIPFSQueries
), then it passes. I think the unit test does not actually build and run a full Drill server, which is why the connections are rejected.
@dbw9580
The ClusterTest
class is supposed to start a Drill cluster so that you can execute queries. You should not need to have a Drill cluster running for the unit tests to complete.
I think the reason this isn't doing what you're expecting is that in the initIPFS
function in IPFSTestSuit
you are creating a plugin with a null configuration and hence isn't initializing correctly.
I stepped through testSimple()
with the debugger and the dataset
object is null
, hence the test fails. My suspicion is that there is one small step being missed here. Could you please take a look and step through this to make sure that Drill is being initialized correctly?
Thanks
@cgivre Does Drill support connections from IPv6 sockets? Is it enabled by default or do I have to toggle some configuration items? The "protocol family unavailable" error could be due to lack of support for IPv6.
@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (https://github.com/apache/drill/pull/1857) but I'm not sure if that is directly relevant. Were you able to get it working?
The connection rejected: /127.0.0.1:31011
failure was because sometimes Drill does not bind to the default ports (31010, 31011, 31012
). It can bind to later ports like 31013, 31014, 31015
, hence the connection was rejected.
I believe the reason Drill didn't bind to the default ports is that those ports was used by the process from the last test run and had not been recycled by the system. If I wait for a minute or two before starting another round of testing, it's likely the test will pass.
This is part of DRILL-7754, but I haven't come up with a plan to reliably store the ports info in IPFS.
@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant. Were you able to get it working?
I don't see Drill binding to any IPv6 address in ss -6ltnp
. I blocked IPv6 addresses in 9494a30 and the tests are now passing (most of the time, due to https://github.com/apache/drill/pull/2084#issuecomment-674045895).
@dbw9580 The unit tests are passing now on my machine.
@cgivre I tried to set the ports to their default values in c090a43, but it did not seem to do the trick. Why is that?