gitbucket-fess-plugin
gitbucket-fess-plugin copied to clipboard
[question] what kind of http request method using with file crawling?
plugin version
1.3.1
gitbucket version
4.20
what is matter
under the proxy environment . I can't get content from files but can get issue, wikis. fess-crawler.log is as follows,
# file crawling log
2018-02-13 18:12:32,511 [5DFNjmEBO7Desvq7XhyO-1] INFO Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge
[2018-02-13 18:12:35,028 [5DFNjmEBO7Desvq7XhyO-1] WARN Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e
org.codelibs.fess.crawler.exception.CrawlingAccessException: Failed to parse http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:184) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.codelibs.fess.crawler.exception.MultipleCrawlingAccessException:
Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:95) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
... 9 more
# issue crawl log
2018-02-13 18:43:02,794 [5DFNjmEBO7Desvq7XhyO-1] INFO Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/17
On Linux, both requests seem to return the same result.
# file request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/README.md
{"message":"Requires authentication"}
# issue request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/21
{"message":"Requires authentication"}
I think that it may be a problem in setting proxy. (Proxy discards file request) I would like to know about the http request of the file crawl API.
thanks.
I'm not sure that your problem is caused by the proxy but could you try the following command?
$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/contents/<file name>?ref=<commit hash>&large_file=true"
The value <token> is the one generated by GitBucket here.
The value <commit hash> is b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e in your case.
It can be obtained by:
$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/git/refs/heads/master
If you want to learn how Fess gets files more, see GitBucketDataStoreImpl.java.
thanks @kw-udon. I got a response when I submitted a command you pointed out.
# curl -H "Authorization: token 284530a64e55176f9ed9*********" "http://gitbucket:8080/gitbucket/api/v3/repos/root/name/contents/hoge?ref=efcd9adbec49f73f762b7b2127153593024e4bea&large_file=true"
{"type":"file","name":"hoge","path":"hoge","sha":"efcd9adbec49f73f762b7b2127153593024e4bea","content":"IyBBcHAgYXJ0aWZhY3RzCi9fYnVpbGQKLLmV4cw==","encoding":"base64","download_url":"http://gitbucket:8080/gitbucket/api/v3/repos/root/name/raw/efcd9adbec49f73f762b7b2127153593024e4bea/hoge"}
so proxy didn't discard request and refused.
MultipleCrawlingAccessException is occured in your log file, but I don't know what can raise this exception.
Do you have any idea @marevol?
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
The cause is above. It's a network problem. I think that the problem is a proxy setting or the like.
@marevol @kw-udon There is only one crawler that crawls gitbucket. How do I get detailed logs to execute curl request when crawling starts?
https://github.com/codelibs/fess/issues/1073#issuecomment-304397187
@marevol thanks! I set the crawl log level info to debug, fess-crawler.log is as follows.
- fess-crawler.log
2018-02-15 14:15:37,744 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Accessing http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
2018-02-15 14:15:37,745 [5DFNjmEBO7Desvq7XhyO-1] DEBUG CookieSpec selected: default
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection request: [route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection leased: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 1 of 20; total allocated: 1 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Opening connection {}->http://gitbucket:8080
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connecting to gitbucket/IP:8080
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Cancelling request execution
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
org.codelibs.fess.crawler.exception.CrawlingAccessException: Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:820) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.crawler.client.http.HcHttpClient.doHttpMethod(HcHttpClient.java:623) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:582) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 2
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_161]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_161]
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.4.jar:4.5.4]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.4.jar:4.5.4]
at org.codelibs.fess.crawler.client.http.HcHttpClient.executeHttpClient(HcHttpClient.java:852) ~[fess-crawler-2.0.1.jar:?]
at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:660) ~[fess-crawler-2.0.1.jar:?]
... 13 more
...
2018-02-15 14:15:42,103 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2018-02-15 14:15:42,105 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
From this log connection appears to be disconnected by connection timeout or connection refused. and I also changed gitbucket logback-setting.xml like this, but no application log found.
- logback-setting.xml
<configuration debug="true" scan="true" scanPeriod="60 seconds">
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<!-- encoders are by default assigned the type
ch.qos.logback.classic.encoder.PatternLayoutEncoder -->
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern> %date %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
<!-- encoders are by default assigned the type
ch.qos.logback.classic.encoder.PatternLayoutEncoder -->
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- rollover daily and compress-->
<fileNamePattern>/gitbucket/log/gitbucket-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<!-- compressed logs are remains 30 days and then deleted -->
<maxHistory>30</maxHistory>
<timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
<maxFileSize>25MB</maxFileSize>
</timeBasedFileNamingAndTriggeringPolicy>
</rollingPolicy>
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern>%d{HH:mm:ss.SSS} %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<root level="DEBUG">
<appender-ref ref="STDOUT"/>
<appender-ref ref="ROLLING"/>
</root>
</configuration>
any ideas?
Did you configure proxy settings? See https://github.com/codelibs/fess/issues/1066
@marevol yes. I configured proxy setting in fess_config.properties
http.proxy.host=proxy_IP
http.proxy.port=proxy_port
http.proxy.username=
http.proxy.password=
- my proxy does not authenticate users.