nutch
nutch copied to clipboard
NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch
This is a first pass at integration of the Anthelion plugin for Nutch. @RobertMeusel, can you please scope this out? The initial question I need answered is whether we need to ship Anthelion itself with Nutch? This patch does exactly that. Its important to state that this patch is a monster and I am not proposing it for a commit. Instead feedback on it would be very much appreciated.
@lewismc why don't we pull this into a branch? Then we can commit and update there.
I agress with @chrismattmann , it would be good if we can have this patch. I work at the moment on the same patch but for the branch 2.x. I will pull it if i'm satisfied of the result. And if you want i can test your patch for the master branch @lewismc. I had some problems of library version in my last try if i remember good.
@jeremie70 the problems with the library version made us release the nutch code as well in the rep, as this was one of the only versions we (or in particular petar) got it working without any library conflicts.
There are issues with Any23 which I've been working on. They are blocking issues. Head over to Any23 to find out about them if any of you are interested. Lewis
On Tuesday, April 19, 2016, Robert [email protected] wrote:
@jeremie70 https://github.com/jeremie70 the problems with the library version made us release the nutch code as well in the rep, as this was one of the only versions we (or in particular petar) got it working without any library conflicts.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/apache/nutch/pull/97#issuecomment-211943820
Lewis
@lewismc any blocking issues left in Any23?
@RobertMeusel @HansBrende this is ready to be tested. I would also appreciated if folks were able to VOTE on the current Any23 2.2 release candidate. Finally, I've resolved all conflicts, updated some licensing information and remove binary documentation resources, instead hosting them on the Nutch wiki.
@lewismc I tried to vote, but I'm not sure if it went through. It's possible that my e-mail address isn't allowed.
You can subscribe as follows http://any23.apache.org/mail-lists.html, thank you for your support.
@lewismc I can't get your NUTCH-2202 branch to build.
I'm doing:
git clone https://github.com/lewismc/nutch
cd nutch
git checkout NUTCH-2202
ant
which is giving me:
Buildfile: /Users/hansbrende/nutch/build.xml Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
. . .
init:
[mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion
[mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes
[mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test
[mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib
[mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion
init-plugin:
deps-jar:
init:
init-plugin:
clean-lib:
resolve-default:
[ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml
compile:
jar:
clean-lib:
resolve-default:
[ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml
compile:
[echo] Compiling plugin: anthelion
[javac] Compiling 34 source files to /Users/hansbrende/nutch/build/anthelion/classes
[javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35: error: cannot find symbol
[javac] import moa.core.InstancesHeader;
[javac] ^
[javac] symbol: class InstancesHeader
[javac] location: package moa.core
[javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33: error: cannot find symbol
[javac] import moa.core.InstancesHeader;
[javac] ^
[javac] symbol: class InstancesHeader
[javac] location: package moa.core
[javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19: error: cannot find symbol
[javac] import moa.core.InstancesHeader;
[javac] ^
[javac] symbol: class InstancesHeader
[javac] location: package moa.core
. . .
[javac] 46 errors
BUILD FAILED
/Users/hansbrende/nutch/build.xml:116: The following error occurred while executing this line:
/Users/hansbrende/nutch/src/plugin/build.xml:37: The following error occurred while executing this line:
/Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see the compiler error output for details.
Am I doing something wrong?
@lewismc I created a pull request for your NUTCH-2202 branch which fixes the build failures. Please let me know if I need to change anything else.
Hi @HansBrende I appreciate it, I was unable to get to this for a while. I've merged and pushed your changes. I think we need further review before we consider merging. Also, I have a feeling that the weka library has a non-compliant license. We have some investigation to do.
Hi,
Is it possible to use Anthelion with Nutch 1.16 version? Are we doing any work on this currently?
Hi @nitin-course5
Is it possible to use Anthelion with Nutch 1.16 version?
It looks like this branch has some conflicts but they will be trivial to fix. If you are able to do that then by all means please do.
Are we doing any work on this currently?
No, no-one is currently.
Hi @lewismc , The first work which i started is to make anthelion compatible with any23 2.2 or 2.3 version(primarily to fix issue : Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 278; columnNumber: 7; The element type "hr" must be terminated by the matching end-tag ""." I must admit from last one day i am stuck inspite of my best effort. Do you have any pointer or any suggestion please?
The first work which i started is to make anthelion compatible with any23 2.2 or 2.3 ...
Can you open a pull request?I would highly encourage you use Any23 2.3.
I must admit from last one day i am stuck inspite of my best effort.
What are you stuck with? Please describe. Thanks
Closing this off to clean up the PR list. We are planning on retiring Any23 soon.