drill icon indicating copy to clipboard operation
drill copied to clipboard

DRILL-7745: Add storage plugin for IPFS

Open dbw9580 opened this issue 4 years ago • 42 comments

Add storage plugin for IPFS. See detailed introduction here.

TODOs:

  • [x] Port to Drill 1.18.0
  • [ ] Add more tests
  • [x] Support JSON reader

Authors:

This storage plugin was contributed by Bowen Ding, Yuedong Xu and Liang Wang. The authors are affiliated with Fudan University.

dbw9580 avatar May 31 '20 17:05 dbw9580

@dbw9580 Thanks for contributing this. Do you want review comments now?

cgivre avatar May 31 '20 17:05 cgivre

Yes, please.

Some major problems I'm working on are:

  • support for different data formats. Currently only JSON files are supported, and CSV support is removed due to changes introduced in v1.17 and v1.18. The original implementation was a copy-paste of the easy format plugin (org/apache/drill/exec/store/easy/text/TextFormatPlugin.java). I wonder if there is a better way to reuse the code there.
  • support for CREATE TABLE statements. This will require changes to the Drill framework responsible for SQL parsing, query plan generation, etc. I would appreciate opinions from you.

dbw9580 avatar May 31 '20 17:05 dbw9580

Added small refactoring ideas. Did not check implementation details

sanel avatar May 31 '20 18:05 sanel

@dbw9580 One more comment. You'll want to add your plugin to the distribution files so that it will be built when Drill is built.

You'll have to do that here: https://github.com/apache/drill/blob/7d5b6116ba524769f8ba43ff03291eff62de1205/distribution/pom.xml#L300-L304

and here: https://github.com/apache/drill/blob/7d5b6116ba524769f8ba43ff03291eff62de1205/distribution/src/assemble/component.xml#L28-L56

cgivre avatar Jun 02 '20 17:06 cgivre

@dbw9580 We're looking to get a release of Drill out by the end of June. If you can get these revisions done soon, I can help expedite the review so that we can get this (or at least an MVP of this) into Drill 1.18.

cgivre avatar Jun 18 '20 15:06 cgivre

Yes I'll try my best. I was stuck with CSV and writer support, and haven't had much progress so far. Can we settle with basic JSON and reader support for now, and maybe add those later?

dbw9580 avatar Jun 18 '20 15:06 dbw9580

@dbw9580 Why don't you break out CSV and the writer support as separate PRs. Then we can get this in, and work on those for the next release.

I've never done a writer, but I can assist with the CSV reader. Take a look here: https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVBatchReader.java

The HTTP storage plugin can read either JSON or CSV. I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

cgivre avatar Jun 18 '20 15:06 cgivre

break out CSV and the writer support as separate PRs.

Great. Will do.

I attempted to use Drill's built in CSV reader but I would have had to do a lot of work on the CSV reader to get it to work... So... just used this simple version.

Ok. The text reader module from the easy format plugin framework looks good, but I couldn't figure out a way to reuse that part of code in this plugin. Copy-pasting code is not accepted, I assume?

dbw9580 avatar Jun 18 '20 16:06 dbw9580

@dbw9580 I had another thought about the design. I wonder if instead of "rolling your own" readers, if you could do what the file plugin does and make use of all the existing format plugins. Basically, that way any new file format for which there is a format plugin could also be read by IPFS. I don't know IPFS that well to know if that would even make sense, or not.

That also might simplify the code a lot.

cgivre avatar Jun 19 '20 19:06 cgivre

@dbw9580 This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

dbw9580 avatar Jun 23 '20 01:06 dbw9580

@dbw9580 This is definitely making progress. Will testing require an IPFS installation?

Yes, a running IPFS daemon is required.

Can you write some tests that do not require the IPFS daemon? I realize that we'll need it for some unit tests, but is it possible to test various components w/o the daemon?

cgivre avatar Jun 23 '20 01:06 cgivre

some tests that do not require the IPFS daemon

I guess for the storage plugin config, scanSpec, etc, they are static and independent on a running IPFS daemon. Also for the json reader, I think I can supply test data from memory directly, w/o actually retrieving it from IPFS.

dbw9580 avatar Jun 23 '20 02:06 dbw9580

@cgivre I've added more tests. The tests are not passing, something about Error while applying rule DrillScanRule. However, I was able to successfully execute the test queries through Drill web interface. I don't know how to fix these tests?

Edit: attach log file. org.apache.drill.exec.store.ipfs.TestIPFSQueries.txt

dbw9580 avatar Jun 27 '20 15:06 dbw9580

@cgivre I've added more tests. The tests are not passing, something about Error while applying rule DrillScanRule. However, I was able to successfully execute the test queries through Drill web interface. I don't know how to fix these tests?

Edit: attach log file. org.apache.drill.exec.store.ipfs.TestIPFSQueries.txt

It appears that the query is not getting through the planning phase. My suggestion is to take a look at this tutorial about writing storage plugins: https://github.com/paul-rogers/drill/wiki/Storage-Plugin, and and specifically, follow the debugging procedures that Paul outlines. My hunch here is that something is going wrong with the schema resolution.

cgivre avatar Jun 28 '20 04:06 cgivre

@cgivre it turned out what was blocking the tests was that the default number of providers in test config was too large, as a result IPFS could not find any other providers in time, thus the TimeoutExceptions. I wish the test logs had included full stack traces, which could have saved me hours looking into the Drill planner internals... 😓

dbw9580 avatar Jun 28 '20 16:06 dbw9580

Cleaning up the PR. I was thinking about the unit tests and it might be good to include unit tests using Mockito to mock up some of the various components. That way we can test at least some of this without the IPFS daemon. I can post an example if you'd like.

Would appreciate that.

dbw9580 avatar Jul 09 '20 14:07 dbw9580

Cleaning up the PR. I was thinking about the unit tests and it might be good to include unit tests using Mockito to mock up some of the various components. That way we can test at least some of this without the IPFS daemon. I can post an example if you'd like.

Would appreciate that.

Take a look here for an example:

https://github.com/apache/drill/blob/5900cdfaae20e216d4b87795bd2efc8199e648e6/contrib/storage-elastic/src/test/java/org/apache/drill/exec/store/elasticsearch/ElasticSearchGroupScanTest.java#L42-L96

cgivre avatar Jul 10 '20 04:07 cgivre

@dbw9580 I'll take a look over the weekend. Thanks for the contribution!

cgivre avatar Aug 07 '20 04:08 cgivre

@dbw9580 Please verify that the project builds and passes all checkstyles. TestIPFQueries fails the checkstyle due to unused imports.

cgivre avatar Aug 13 '20 03:08 cgivre

TestIPFQueries fails the checkstyle due to unused imports. @cgivre Hmm I don't see any unused imports in this file and my builds are passing.

dbw9580 avatar Aug 13 '20 15:08 dbw9580

@dbw9580 I redownloaded and it built for me. Please disregard previous comments.

cgivre avatar Aug 13 '20 15:08 cgivre

I'm still having issues actually getting the unit tests that require the IPFS daemon to actually execute.

@cgivre Actually I am having trouble making that test run, too. I keep getting errors like "connection rejected: /127.0.0.1:31011" or "Protocol family unavailable: /0:0:0:0:0:0:0:1:31011". I can test successfully manually through the web ui with drill-embedded, though.

Can you try testing through the web ui, too? The simple dataset should be easy to add to IPFS and test:

ipfs object patch set-data $(ipfs object new) <path-to-simple-dataset.json>

This will return the hash of the simple dataset, which is QmcbeavnEofA6NjG7vkpe1yLJo6En6ML4JnDooDn1BbKmR.

Then run a query through the web ui: select * from ipfs.`/ipfs/QmcbeavnEofA6NjG7vkpe1yLJo6En6ML4JnDooDn1BbKmR#json` . If the query takes too long to complete, try reducing the timeout values as well as the max-peers-per-leaf value in the plugin config.

dbw9580 avatar Aug 13 '20 17:08 dbw9580

If I leave an instance of Drill running and then run the unit test (TestIPFSQueries), then it passes. I think the unit test does not actually build and run a full Drill server, which is why the connections are rejected.

dbw9580 avatar Aug 13 '20 17:08 dbw9580

@dbw9580 The ClusterTest class is supposed to start a Drill cluster so that you can execute queries. You should not need to have a Drill cluster running for the unit tests to complete.

I think the reason this isn't doing what you're expecting is that in the initIPFS function in IPFSTestSuit you are creating a plugin with a null configuration and hence isn't initializing correctly.

I stepped through testSimple() with the debugger and the dataset object is null, hence the test fails. My suspicion is that there is one small step being missed here. Could you please take a look and step through this to make sure that Drill is being initialized correctly? Thanks

cgivre avatar Aug 13 '20 18:08 cgivre

@cgivre Does Drill support connections from IPv6 sockets? Is it enabled by default or do I have to toggle some configuration items? The "protocol family unavailable" error could be due to lack of support for IPv6.

dbw9580 avatar Aug 14 '20 09:08 dbw9580

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (https://github.com/apache/drill/pull/1857) but I'm not sure if that is directly relevant. Were you able to get it working?

cgivre avatar Aug 14 '20 12:08 cgivre

The connection rejected: /127.0.0.1:31011 failure was because sometimes Drill does not bind to the default ports (31010, 31011, 31012). It can bind to later ports like 31013, 31014, 31015, hence the connection was rejected.

I believe the reason Drill didn't bind to the default ports is that those ports was used by the process from the last test run and had not been recycled by the system. If I wait for a minute or two before starting another round of testing, it's likely the test will pass.

This is part of DRILL-7754, but I haven't come up with a plan to reliably store the ports info in IPFS.

dbw9580 avatar Aug 14 '20 12:08 dbw9580

@dbw9580 I believe Drill does support connections from IPv6 sockets. There was a recent PR for this in fact: (#1857) but I'm not sure if that is directly relevant. Were you able to get it working?

I don't see Drill binding to any IPv6 address in ss -6ltnp. I blocked IPv6 addresses in 9494a30 and the tests are now passing (most of the time, due to https://github.com/apache/drill/pull/2084#issuecomment-674045895).

dbw9580 avatar Aug 14 '20 12:08 dbw9580

@dbw9580 The unit tests are passing now on my machine.

cgivre avatar Aug 14 '20 13:08 cgivre

@cgivre I tried to set the ports to their default values in c090a43, but it did not seem to do the trick. Why is that?

dbw9580 avatar Aug 14 '20 13:08 dbw9580