deep_qa icon indicating copy to clipboard operation
deep_qa copied to clipboard

Migrate dataset reader code from (scala) DeepQA Experiments to DeepQA

Open liyi193328 opened this issue 7 years ago • 9 comments

Firstly, Much thanks to this great project, which is what I would like to do; I'll continuously watch, use, and even contribute to this project.

But when I want to run some pipelines from scratch, but found that the data pre processing steps is in another project: https://github.com/allenai/deep_qa_experiments, the project's code is scala.

I think the preprocessing steps in another steps is complicated for someone wishing to start the stuff quickly.

liyi193328 avatar Apr 29 '17 14:04 liyi193328

Yes, as you can see in the README, the data processing code is currently in the scala library. That is for historical reasons. When we write new data processing code, it will almost certainly be in the python library.

However, it's not very high priority for us to migrate the data processing code, because we already have all of the data processed, we know how to use the scala library easily enough, and we have a lot of other things on our plate. This is a great place where contributions would be much appreciated.

For anyone who wants to contribute to this, it's as simple as taking a (scala) DatasetReader from the DeepQA Experiments library and converting it to a python script in the dataset_readers module in DeepQA. Most of these dataset readers are pretty simple, so it shouldn't take that much work to do this (the SquadSentenceSelectionReader script is complicated because it has fancy logic for mixing up the data in interesting ways. The corresponding reader for the standard SQuAD task is much simpler.)

If you just want to use the DeepQA Experiments library to get the data for you, the easiest way to do so is probably like this (steps shown for SQuAD, but are similar for other datasets):

  1. Download the dataset from wherever it lives (for SQuAD, that's here). Extract it if it's some kind of archive file.
  2. Modify the path in the experiment code to point to where you downloaded the files.
  3. Run the following from a terminal, in the base directory for DeepQA Experiments:
sbt console
scala> import com.mattg.util.FileUtil
scala> import org.allenai.deep_qa.pipeline.DatasetStep
scala> import org.allenai.deep_qa.experiments.datasets.SquadDatasets
scala> DatasetStep.create(SquadDatasets.trainDataset, new FileUtil).runPipeline()

And you can repeat that last step for the dev set, or for any other dataset you want to process.

matt-gardner avatar Apr 29 '17 15:04 matt-gardner

@matt-gardner I'll try these steps. Thanks.

liyi193328 avatar Apr 30 '17 05:04 liyi193328

@matt-gardner when I do sbt console, it complains:

[warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: com.clearnlp#clearnlp;2.0.3-allenai: not found [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] [warn] Note: Unresolved dependencies path: [warn] com.clearnlp:clearnlp:2.0.3-allenai [warn] +- org.allenai.openie:openie_2.11:4.2.6 (E:\active_project\deep_qa_experiments\build.sbt#L33-54) [warn] +- org.allenai:deep-qa_2.11:0.2.5 [trace] Stack trace suppressed: run last :update for the full output. [error] (:update) sbt.ResolveException: unresolved dependency: com.clearnlp#clearnlp;2.0.3-allenai: not found [error] Total time: 48 s, completed 2017-5-2 14:10:47

So Is there some missing? thanks.

liyi193328 avatar May 02 '17 07:05 liyi193328

Oh, yeah, sorry about that. I forgot about that dependency. I just removed it, so it should work now. Can you update your repo and try again?

matt-gardner avatar May 02 '17 15:05 matt-gardner

@matt-gardner after update, the dependency solved, but another issue is: [warn] Credentials file C:\Users\liyi1.bintray.credentials does not exist [info] Compiling 1 protobuf files to E:\active_project\deep_qa_experiments\target\scala-2.11\src_managed\main\compiled_protobuf [info] Compiling schema E:\active_project\deep_qa_experiments\src\main\protobuf\message.proto protoc-jar: protoc version: 300, detected platform: windows 10/amd64 protoc-jar: executing: [C:\Users\liyi1\AppData\Local\Temp\protoc7978892183318687115.exe, --plugin=protoc-gen-scala=C:\Users\liyi1\AppData\Local\Temp\scalapbgen3603827797058091070.bat, -IE:\active_project\deep_qa_experiments\src\main\protobuf, -IE:\active_project\deep_qa_experiments\target\protobuf_external, --scala_out=grpc:E:\active_project\deep_qa_experiments\target\scala-2.11\src_managed\main\compiled_protobuf, E:\active_project\deep_qa_experiments\src\main\protobuf\message.proto] Traceback (most recent call last): [trace] Stack trace suppressed: run last protobuf:protobufGenerate for the full output. [error] (protobuf:protobufGenerate) protoc returned exit code: 1 File "C:\Users\liyi1\AppData\Local\Temp\scalapbgen6943038223903542987.py", line 6, in [error] Total time: 1 s, completed 2017-5-3 17:02:24 s.sendall(content)

TypeError: a bytes-like object is required, not 'str'

I'm not familiar to these errors. The systems need linux or os? windows not ok? Thanks

liyi193328 avatar May 03 '17 09:05 liyi193328

Yeah, I have no idea what's going on there. I think the only thing I had to install to get the protobuf stuff to work was this: pip install grpcio grpcio-tools pyhocon, which shouldn't be affecting this step. Can you get the full stack trace with last protobuf:protobufGenerate? Also try installing those python libraries, just to see if that fixes the issue.

This doesn't look like it's a windows issue to me, but even if we figure this out, I think the rest of the code has various places where / is hard-coded instead of using OS-independent paths, so I think you'll have a hard time. We know this runs on linux and macOS, but haven't run it on windows.

Another thing to consider is that at this point, it probably is less work to translate the ~50 lines of scala code in the dataset reader into python than it is to figure out what's going on here.

matt-gardner avatar May 03 '17 15:05 matt-gardner

@matt-gardner So nice to you, I use python3.5 in my environment, resulting to this error. After fix it, import com.mattg.util.FileUtil is a another project in https://github.com/matt-gardner/util? So I can't import it without this depency. Thanks. And I may plan to contribute to python3 code when having some time.

liyi193328 avatar May 04 '17 14:05 liyi193328

The util library is a dependency in the DeepQA Experiments library, and it's grabbed automatically when you run sbt from within that project. If you run sbt console from the root directory of where you cloned DeepQA Experiments, you should be able to import com.mattg.util.FileUtil without a problem.

matt-gardner avatar May 04 '17 15:05 matt-gardner

@matt-gardner can run it now. Thanks all the way.

liyi193328 avatar May 05 '17 15:05 liyi193328