nutch icon indicating copy to clipboard operation
nutch copied to clipboard

NUTCH-2202 Integration of Anthelion (Focused Crawling Module) into Nutch

Open lewismc opened this issue 9 years ago • 15 comments

This is a first pass at integration of the Anthelion plugin for Nutch. @RobertMeusel, can you please scope this out? The initial question I need answered is whether we need to ship Anthelion itself with Nutch? This patch does exactly that. Its important to state that this patch is a monster and I am not proposing it for a commit. Instead feedback on it would be very much appreciated.

lewismc avatar Mar 09 '16 09:03 lewismc

@lewismc why don't we pull this into a branch? Then we can commit and update there.

chrismattmann avatar Apr 17 '16 22:04 chrismattmann

I agress with @chrismattmann , it would be good if we can have this patch. I work at the moment on the same patch but for the branch 2.x. I will pull it if i'm satisfied of the result. And if you want i can test your patch for the master branch @lewismc. I had some problems of library version in my last try if i remember good.

ghost avatar Apr 19 '16 13:04 ghost

@jeremie70 the problems with the library version made us release the nutch code as well in the rep, as this was one of the only versions we (or in particular petar) got it working without any library conflicts.

RobertMeusel avatar Apr 19 '16 14:04 RobertMeusel

There are issues with Any23 which I've been working on. They are blocking issues. Head over to Any23 to find out about them if any of you are interested. Lewis

On Tuesday, April 19, 2016, Robert [email protected] wrote:

@jeremie70 https://github.com/jeremie70 the problems with the library version made us release the nutch code as well in the rep, as this was one of the only versions we (or in particular petar) got it working without any library conflicts.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/apache/nutch/pull/97#issuecomment-211943820

Lewis

lewismc avatar Apr 19 '16 14:04 lewismc

@lewismc any blocking issues left in Any23?

HansBrende avatar Jan 26 '18 04:01 HansBrende

@RobertMeusel @HansBrende this is ready to be tested. I would also appreciated if folks were able to VOTE on the current Any23 2.2 release candidate. Finally, I've resolved all conflicts, updated some licensing information and remove binary documentation resources, instead hosting them on the Nutch wiki.

lewismc avatar Jan 26 '18 19:01 lewismc

@lewismc I tried to vote, but I'm not sure if it went through. It's possible that my e-mail address isn't allowed.

HansBrende avatar Jan 26 '18 20:01 HansBrende

You can subscribe as follows http://any23.apache.org/mail-lists.html, thank you for your support.

lewismc avatar Jan 26 '18 23:01 lewismc

@lewismc I can't get your NUTCH-2202 branch to build.

I'm doing:

git clone https://github.com/lewismc/nutch
cd nutch
git checkout NUTCH-2202
ant

which is giving me:

Buildfile: /Users/hansbrende/nutch/build.xml
Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

. . .

init:
    [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion
    [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/classes
    [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test
    [mkdir] Created dir: /Users/hansbrende/nutch/build/anthelion/test/lib
    [mkdir] Created dir: /Users/hansbrende/nutch/build/plugins/anthelion

init-plugin:

deps-jar:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml

compile:

jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = /Users/hansbrende/nutch/ivy/ivysettings.xml

compile:
     [echo] Compiling plugin: anthelion
    [javac] Compiling 34 source files to /Users/hansbrende/nutch/build/anthelion/classes
    [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/classifier/NutchOnlineClassifier.java:35: error: cannot find symbol
    [javac] import moa.core.InstancesHeader;
    [javac]                ^
    [javac]   symbol:   class InstancesHeader
    [javac]   location: package moa.core
    [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/framework/AnthOnlineClassifier.java:33: error: cannot find symbol
    [javac] import moa.core.InstancesHeader;
    [javac]                ^
    [javac]   symbol:   class InstancesHeader
    [javac]   location: package moa.core
    [javac] /Users/hansbrende/nutch/src/plugin/anthelion/src/java/org/apache/nutch/anthelion/mao/DataManipulationFilter.java:19: error: cannot find symbol
    [javac] import moa.core.InstancesHeader;
    [javac]                ^
    [javac]   symbol:   class InstancesHeader
    [javac]   location: package moa.core

. . .

    [javac] 46 errors

BUILD FAILED
/Users/hansbrende/nutch/build.xml:116: The following error occurred while executing this line:
/Users/hansbrende/nutch/src/plugin/build.xml:37: The following error occurred while executing this line:
/Users/hansbrende/nutch/src/plugin/build-plugin.xml:133: Compile failed; see the compiler error output for details.

Am I doing something wrong?

HansBrende avatar Jan 27 '18 00:01 HansBrende

@lewismc I created a pull request for your NUTCH-2202 branch which fixes the build failures. Please let me know if I need to change anything else.

HansBrende avatar Feb 02 '18 22:02 HansBrende

Hi @HansBrende I appreciate it, I was unable to get to this for a while. I've merged and pushed your changes. I think we need further review before we consider merging. Also, I have a feeling that the weka library has a non-compliant license. We have some investigation to do.

lewismc avatar Feb 03 '18 02:02 lewismc

Hi,

Is it possible to use Anthelion with Nutch 1.16 version? Are we doing any work on this currently?

nitin-course5 avatar Nov 23 '19 11:11 nitin-course5

Hi @nitin-course5

Is it possible to use Anthelion with Nutch 1.16 version?

It looks like this branch has some conflicts but they will be trivial to fix. If you are able to do that then by all means please do.

Are we doing any work on this currently?

No, no-one is currently.

lewismc avatar Nov 23 '19 17:11 lewismc

Hi @lewismc , The first work which i started is to make anthelion compatible with any23 2.2 or 2.3 version(primarily to fix issue : Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 278; columnNumber: 7; The element type "hr" must be terminated by the matching end-tag ""." I must admit from last one day i am stuck inspite of my best effort. Do you have any pointer or any suggestion please?

nitin-course5 avatar Nov 24 '19 20:11 nitin-course5

The first work which i started is to make anthelion compatible with any23 2.2 or 2.3 ...

Can you open a pull request?I would highly encourage you use Any23 2.3.

I must admit from last one day i am stuck inspite of my best effort.

What are you stuck with? Please describe. Thanks

lewismc avatar Nov 25 '19 03:11 lewismc

Closing this off to clean up the PR list. We are planning on retiring Any23 soon.

lewismc avatar May 24 '23 16:05 lewismc