Pascal Essiembre

Results 74 comments of Pascal Essiembre

@dgomesbr, can you elaborate what your exporter would look like? There are a few challenges with cloning websites in general. There are dynamic + javascript rendered ones that do not...

Hello Fabian! Technically, the listeners are not duplicated but rather invoked for ALL events fired by the collector that is an instance of your listener "accept" method argument. That includes...

You will find what job failed earlier in the log. I am marking it as a feature request to have it also reported as part of a status summary at...

> Is there a way to "tell" Norconex Collector that an unavailable URL is not a Document deletion? I can think of a few ways. **Do not keep crawl history:**...

To "clean" a repo with 2.9.x, you have little choice but to delete the crawl store (or the entire "workdir" folder). That would address your first bullet. Using "IGNORE" or...

Glad you found a way. The crawl cache only keeps traces of documents from their last session so that they can be compared with the next session. It does not...

Can you please share your full config? What is your `` tag like? If you have stayOnXXX flags to `true`, external sites won't be crawled regardless of your filters.

I tested with your config and version 2.9.1 and the following worked for me: ```xml https://domain.domain.com/.* https://domain.domain.com/.* ``` I replaced your document filters with metadata filters so documents that are...

A new snapshot release was just made with a fix that now considers the "effective" top-level domain for a URL instead of just the last two parts of the domain....

No, there are currently none. Good idea though. I will mark as a feature request. In the meantime, if you know your Java, you can implement your own solution by...