fscrawler
fscrawler copied to clipboard
Fscrawler not connecting to workplace search over https
Describe the bug
Fscrawler is working properly poiting to elasticsearch, but pointing to workplace search it is not working, the output is not showing any bugs, it says that the folder is being crawled every 15 minutes but nothing else.
Job Settings
name: "job_name"
fs:
url: "/tmp/es"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
username: "elastic"
password: "PASSWORD"
workplace_search:
access_token: "TOKEN"
key: "KEY"
server: "https://elasticsearch:3002"
Expected behavior
I'd expect the data to appear in workplace search.
Versions: Centos 7 fscrawler-es7-2.7-SNAPSHOT elastic and workplace search 7.9.0
If you downloaded the version from sonatype snapshots repository that's expected as the PR has not been merged yet.
Where did you download it from?
I've downloaded from sonatype: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/
Sorry, I'm a bit confussed, I went to https://fscrawler.readthedocs.io/en/wip-workplace_search/admin/fs/wpsearch.html and I thought that workplace search connection was a new feature in fscrawler 2.7. Should I download from other place? I'm reading here https://fscrawler.readthedocs.io/en/wip-workplace_search/admin/fs/wpsearch.html but I'm not seeing other way. Thank you for your time!
The branch you are looking at is a WIP. I just generated a version and shared it here: https://www.dropbox.com/s/07msuwno2gw3noq/fscrawler-es7-2.7-SNAPSHOT.zip?dl=0
Thank you so much, now it's working, awesome work! Is there a way to bypass ssl? like -k option?
What do you mean? Could you share the config file and the error?
I have a domain for elastic connection in fscrawler and it works but when I put the same domain for workplace I'm getting:
15:19:47,128 ERROR [f.p.e.c.f.t.w.WPSearchEngine] javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No name matching
If I put ip instead of domain I'm getting: 15:08:29,954 ERROR [f.p.e.c.f.t.w.WPSearchEngine] javax.ws.rs.ProcessingException: j avax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subj ect alternative names present
name: "job_name"
fs:
url: "/tmp/"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "https://elasticsearch:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
username: "elastic"
password: "PASSWORD"
orkplace_search:
access_token: "TOKEN"
key: "KEY"
server: "https://elasticsearch:3002"
If I put the ip instead of domian in elastic connection I'm getting:
15:04:07,918 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting. java.io.IOException: Host name '192.168.1.140' does not match the certificate subjec t provided by the peer (CN=elasticsearch)
orkplace_search
this is a typo I guess?
How are you starting Elasticsearch and Workplace search?
Yes, sorry that was a typo when I paste here, but in the file it's correctly written. I've downloaded .tar.gz, untar, config file and then bin/elasticsearch. /usr/share/enterprise-search/bin/enterprise-search
elasticsearch.yml
http.port: 9200
network.host: 0.0.0.0
discovery.seed_hosts: []
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: http.p12
xpack:
security:
authc:
realms:
native:
native1:
order: 0
enterprise-search.yml
secret_management.encryption_keys: [XXX]
allow_es_settings_modification: true
elasticsearch.host: https://192.168.1.140:9200
elasticsearch.username: elastic
elasticsearch.password: XXX
elasticsearch.ssl.enabled: true
elasticsearch.ssl.certificate_authority: /home/prueba_rally/elasticsearch-7.8.0/config/elasticsearch-ca.pem
elasticsearch.ssl.verify: false
ent_search.external_url: https://192.168.1.140:3002
ent_search.listen_host: 0.0.0.0
ent_search.listen_port: 3002
ent_search.auth.source: standard
ent_search.ssl.enabled: true
ent_search.ssl.keystore.path: /home/x/elasticsearch-7.8.0/config/elastic-stack-ca.p12
I need to implement this I think: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_encrypted_communication.html
Or add the option described in #969 to disable the verification.
It's not yet possible unless you fork the branch and write yourself the modifications...
Thank you very much, I'm going to try to do same as described in #969.
Can the folder URL be a remote computer folder over the internet, http, ftp? For example:
fs:
url: "www.google.com/folder"
No for http. It's not a website crawler.
No for http. It's not a website crawler.
Thanks. But it can over ftp, sftp?
Over ssh: https://fscrawler.readthedocs.io/en/latest/admin/fs/ssh.html
Over ftp: see #147 which needs to be implemented.
Thanks. The --debug option returned additional information: https://discuss.elastic.co/t/fscrawler-for-es-clustering/216939/6. Can it be set the other way around with a local folder and remote ElasticSearch instance, with or without a private key in a format like .pem? The working configuration file with remote folder and local ElasticSearch instance:
---
name: "index_ssh"
server:
hostname: "ip-address.eu-west-number.compute.amazonaws.com"
port: 22
username: "server_name"
protocol: "ssh"
pem_path: "key.pem"
fs:
url: "/home/ubuntu/PDFRecored"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
username: "username"
password: "password"