elasticsearch-river-web
elasticsearch-river-web copied to clipboard
Error during elastic search startup
[2014-07-23 13:02:51,591][WARN ][org.apache.tika.mime.MimeTypesReader] Invalid media type configuration entry: application/dita+xml;format=map
org.apache.tika.mime.MimeTypeException: Invalid media type name: application/dita+xml;format=map
at org.apache.tika.mime.MimeTypes.forName(MimeTypes.java:367)
at org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152)
at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:139)
at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:122)
at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56)
at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:68)
at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:79)
at org.seasar.robot.helper.impl.MimeTypeHelperImpl.
What is a version of Elasticsearch? and could you check: $ ls $ES_HOME/plugins//tika
Elasticsearch Version is 1.2.1 and tika version is 1.4(tika-core-1.4)
If you install river-web 1.2, you can find:
plugins/river-web/tika-core-1.5.jar
Which plugin has tika 1.4?
I have reinstalled the river web plugin, Now its started perfectly. I have made the crawling using the data in the issue 18 by making incremental false. Still I got propeties:{} as empty when i ran the following url
curl -XGET localhost:9200/compassion_uat/compassion_web/_mapping?pretty
How to check whether crawl data inserted into elastic search index or not?
Did you try the following example? https://github.com/codelibs/elasticsearch-river-web#register-crawl-data
How to check whether crawl data inserted into elastic search index or not?
Please use Elasticsearch's Search API.
I tried it. Its working perfectly...
Thanks a lot...
But when I changed the URL to "https://www.google.com" or any https site, crawling is not working.
Please help me in resolving this issue
It's better to check Elasticsearch's log files.
When Im using the http url, then the log files shows the url has picked up by the Robot client. Please see the below logs
[2014-07-28 11:25:00,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:02,001][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found [2014-07-28 11:25:02,101][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://localhost:8080/SampleRestTemplate/index.html [2014-07-28 11:25:02,105][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: http://localhost:8080/robots.txt [2014-07-28 11:25:04,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:06,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:08,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:08,290][INFO ][cluster.metadata ] [Bloodlust] [[_river]] remove_mapping [[my_web]]
But when i configured the https site then the logs doesn't show any crawling url that its picked.. Its simply running
2014-07-28 11:25:39,468][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Creating WebRiver: my_web [2014-07-28 11:25:39,470][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] Scheduling CrawlJob... [2014-07-28 11:25:39,494][INFO ][cluster.metadata ] [Bloodlust] [_river] update_mapping my_web [2014-07-28 11:25:40,001][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org.seasar.robot.client.FaultTolerantClient not found [2014-07-28 11:25:42,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:44,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:46,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:48,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:50,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:52,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:54,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:56,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:25:58,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:26:00,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:26:02,006][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:26:04,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:26:06,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-28 11:26:08,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
Im using elastic search 1.0.2 version and river-web 1.1.0
Could you attach the river registration command you ran?
I'm using rest client for running the command.. Pls find below steps that i followed
I'm used some other https site than www.google.com
Create an Index http://localhost:9200/webindex
Mapping http://localhost:9200/webindex/my_web/_mapping { "my_web" : { "dynamic_templates" : [ { "url" : { "match" : "url", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "method" : { "match" : "method", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "charSet" : { "match" : "charSet", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "mimeType" : { "match" : "mimeType", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } } ] } }
Register Crawl Data http://localhost:9200/_river/my_web/_meta { "type" : "web", "crawl" : { "index" : "webindex", "url" : ["https://www.google.com"], "includeFilter" : ["https://www.google.com/."], "maxDepth" : 3, "maxAccessCount" : 100, "numOfThread" : 5, "interval" : 1000, "robotsTxt":false, "target" : [ { "pattern" : { "url" : "https://www.google.com/.", "mimeType" : "text/html" }, "properties" : { "title" : { "text" : "title" }, "body" : { "text" : "body" }, "bodyAsHtml" : { "html" : "body" }, "projects" : { "text" : "ul.nav-list li a", "isArray" : true } } } ] }, "schedule" : { "cron" : "*/2 * * * * ?" } }
Let me know if you need any other details
"includeFilter" : ["https://www.google.com/.*"],
How about the following setting?
"includeFilter" : ["https://www.google.com.*"],
Great .... Now its started crawling. But i got connection refused. is any proxy need to set?
Please see the below logs
[2014-07-29 16:56:46,067][WARN ][org.seasar.framework.container.assembler.BindingTypeShouldDef] Skip setting property, because property(requestListener) of org .seasar.robot.client.FaultTolerantClient not found [2014-07-29 16:56:46,309][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: https://www.google.com [2014-07-29 16:56:46,339][INFO ][org.seasar.robot.client.http.HcHttpClient] Checking URL: https://www.google.com/robots.txt [2014-07-29 16:56:48,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. [2014-07-29 16:56:50,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
2014-07-29 17:01:40,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:42,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:44,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:46,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:48,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:50,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:52,001][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:54,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:55,579][INFO ][org.seasar.robot.client.http.HcHttpClient] Could not process https://www.google.com/robots.txt. Connection to https://www.goog e.com refused 2014-07-29 17:01:56,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:01:58,002][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running. 2014-07-29 17:02:00,000][INFO ][org.codelibs.elasticsearch.web.river.WebRiver] web.my_webJob is running.
But i got connection refused. is any proxy need to set?
I think it depends on your network environment. If google checks UserAgent, your crawling request may be refused.
Im using robotsTxt as false. Then it should ignore the robot txt. Am i right?
depends on your network environment - do you mean firewall or proxy?