fscrawler Fscrawler not connecting to workplace search over https

Describe the bug

Fscrawler is working properly poiting to elasticsearch, but pointing to workplace search it is not working, the output is not showing any bugs, it says that the folder is being crawled every 15 minutes but nothing else.

Job Settings

name: "job_name"
fs:
  url: "/tmp/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "https://elasticsearch:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  username: "elastic"
  password: "PASSWORD"
workplace_search:
  access_token: "TOKEN"
  key: "KEY"
  server: "https://elasticsearch:3002"

Expected behavior

I'd expect the data to appear in workplace search.

Versions: Centos 7 fscrawler-es7-2.7-SNAPSHOT elastic and workplace search 7.9.0

Sep 07 '20 14:09 foxssg

If you downloaded the version from sonatype snapshots repository that's expected as the PR has not been merged yet.

Where did you download it from?

Sep 07 '20 15:09 dadoonet

I've downloaded from sonatype: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/

Sorry, I'm a bit confussed, I went to https://fscrawler.readthedocs.io/en/wip-workplace_search/admin/fs/wpsearch.html and I thought that workplace search connection was a new feature in fscrawler 2.7. Should I download from other place? I'm reading here https://fscrawler.readthedocs.io/en/wip-workplace_search/admin/fs/wpsearch.html but I'm not seeing other way. Thank you for your time!

Sep 07 '20 16:09 foxssg

The branch you are looking at is a WIP. I just generated a version and shared it here: https://www.dropbox.com/s/07msuwno2gw3noq/fscrawler-es7-2.7-SNAPSHOT.zip?dl=0

Sep 07 '20 19:09 dadoonet

Thank you so much, now it's working, awesome work! Is there a way to bypass ssl? like -k option?

Sep 10 '20 15:09 foxssg

What do you mean? Could you share the config file and the error?

Sep 10 '20 16:09 dadoonet

I have a domain for elastic connection in fscrawler and it works but when I put the same domain for workplace I'm getting:

15:19:47,128 ERROR [f.p.e.c.f.t.w.WPSearchEngine] javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No name matching

If I put ip instead of domain I'm getting: 15:08:29,954 ERROR [f.p.e.c.f.t.w.WPSearchEngine] javax.ws.rs.ProcessingException: j avax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subj ect alternative names present

name: "job_name"
fs:
  url: "/tmp/"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "https://elasticsearch:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  username: "elastic"
  password: "PASSWORD"
orkplace_search:
  access_token: "TOKEN"
  key: "KEY"
  server: "https://elasticsearch:3002"

If I put the ip instead of domian in elastic connection I'm getting:

15:04:07,918 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting. java.io.IOException: Host name '192.168.1.140' does not match the certificate subjec t provided by the peer (CN=elasticsearch)

Sep 11 '20 18:09 foxssg

orkplace_search this is a typo I guess?

How are you starting Elasticsearch and Workplace search?

Sep 14 '20 10:09 dadoonet

Yes, sorry that was a typo when I paste here, but in the file it's correctly written. I've downloaded .tar.gz, untar, config file and then bin/elasticsearch. /usr/share/enterprise-search/bin/enterprise-search

elasticsearch.yml

http.port: 9200
network.host: 0.0.0.0
discovery.seed_hosts: []
xpack.security.enabled: true
xpack.security.authc.api_key.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: http.p12


xpack:
  security:
    authc:
      realms:
        native:
          native1:
            order: 0

enterprise-search.yml

secret_management.encryption_keys: [XXX]
allow_es_settings_modification: true
elasticsearch.host: https://192.168.1.140:9200
elasticsearch.username: elastic
elasticsearch.password: XXX
elasticsearch.ssl.enabled: true
elasticsearch.ssl.certificate_authority: /home/prueba_rally/elasticsearch-7.8.0/config/elasticsearch-ca.pem
elasticsearch.ssl.verify: false
ent_search.external_url: https://192.168.1.140:3002
ent_search.listen_host: 0.0.0.0
ent_search.listen_port: 3002
ent_search.auth.source: standard
ent_search.ssl.enabled: true
ent_search.ssl.keystore.path: /home/x/elasticsearch-7.8.0/config/elastic-stack-ca.p12

Sep 14 '20 13:09 foxssg

I need to implement this I think: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_encrypted_communication.html

Or add the option described in #969 to disable the verification.

It's not yet possible unless you fork the branch and write yourself the modifications...

Sep 14 '20 13:09 dadoonet

Thank you very much, I'm going to try to do same as described in #969.

Sep 14 '20 13:09 foxssg

Can the folder URL be a remote computer folder over the internet, http, ftp? For example:

fs:
  url: "www.google.com/folder"

Sep 14 '20 15:09 dminovski0

No for http. It's not a website crawler.

Sep 14 '20 15:09 dadoonet

No for http. It's not a website crawler.

Thanks. But it can over ftp, sftp?

Sep 14 '20 15:09 dminovski0

Over ssh: https://fscrawler.readthedocs.io/en/latest/admin/fs/ssh.html

Over ftp: see #147 which needs to be implemented.

Sep 14 '20 16:09 dadoonet

Thanks. The --debug option returned additional information: https://discuss.elastic.co/t/fscrawler-for-es-clustering/216939/6. Can it be set the other way around with a local folder and remote ElasticSearch instance, with or without a private key in a format like .pem? The working configuration file with remote folder and local ElasticSearch instance:

---
name: "index_ssh"
server:
  hostname: "ip-address.eu-west-number.compute.amazonaws.com"
  port: 22
  username: "server_name"
  protocol: "ssh"
  pem_path: "key.pem"
fs:
  url: "/home/ubuntu/PDFRecored"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  username: "username"
  password: "password"

Sep 14 '20 19:09 dminovski0

fscrawler fscrawler copied to clipboard

Fscrawler not connecting to workplace search over https

fscrawler
fscrawler copied to clipboard