connectors-sdk-resources
connectors-sdk-resources copied to clipboard
How to enable "Purge Items" Feature ?
Hi, Tsuyoshi from Raytion GmbH here. I read about the Purge Items Feature which is also partially mentioned in your documentation here. I assume that this is a feature where items stored in the Crawl DB and not being fetched in the previous job will be cleaned up. Will those documents will be also deleted from the Solr Collection (they are also registered as Documents) ? If yes, does this feature requires to be enabled explicitly as we are currently not able to observe our unvisited items to be deleted. Does the same mechanism also applies to Access Controls ?
Thank you in advance.
Hi @tsuyoshihamano. What version of the connectors-sdk and Fusion are you using?
Hi @mwmitchell , it is the SDK in version 3.0.0 and Fusion 5.3
@mwmitchell are you familiar with this issue? Christian at Raytion reported it's currently blocking them on the Yammer/MS Teams connector work. We have our update with Raytion tomorrow morning if it'd be helpful for you to join.
@tsuyoshihamano "Purge Items" feature should work for connectors that does recrawls. Yes, it should delete items in crawlDb as well as content collection, if they are not modified. This holds true for AccessControlItem in crawlDb, but it does not delete any thing from AccessControl collection.
Purge items should be enabled by default in 5.3.
Are you emitting checkpoint as in incremental crawl?
Thanks @puneetkhanal , we are not omitting checkpoints. We do have documents emitted directly without emitting checkpoints. I guess I need more explanation for how deletions are detected internally within your infrastructure. I expected the documents in the crawl db to be detected as deleted if on the subsequent job the documents have not been emitted. Could you confirm ? Also, for Access Controls, we need to emit the deletion explicitly to be deleted from the Access Control Collection right ?
@tsuyoshihamano I looked further regarding purge stray items. Now, the purge stray items works in a special case, for that connector needs to emit a checkpoint and emit candidates with isTransient:false
. This was a special scenario where a customer had an incremental connector emitting a checkpoint but they could not figure out which items to delete, so, in that we would look at the crawlDb and delete outdated items.
We would like to understand more about your use case. So, if you are implementing a recrawl or incremental connector, then you need to emit a delete for that item, in order to remove that item from solr collection. Purge items would work only in the special case, I mentioned above (as this is for special case only, it's better not to rely upon this as this is subject to change)
Ideally, it would be better if the connector could figure out by itself which items it needs to delete and emit a delete for that item. The same case holds for AccessControlItem also.
@puneetkhanal ,
It sounds like we have a similar use case like the first one you described. Currently we are emitting for each document a candidate. Do we need a checkpoint as well or is one of them sufficient ? If only the candidate is sufficient, do we need to explicitly set the isTransient
to false (saw in the documentation that it is set to false per default for candidates). Do we need to set further information (e.g. metadata) for the candidate as the "Purge Items" Feature does not work with our current implementation.
@tsuyoshihamano yeah isTransient
is false by default, so you don't need to do anything. You need to emit a checkpoint at the end of the first crawl (metadata is optional) and in the next crawl you will only get that checkpoint item in your connector and based upon that checkpoint you can further emit other candidates.
@puneetkhanal , would the next crawl also then emit a checkpoint after the crawl ? The deletions will then happen based on the diff of candidates between checkpoints of different crawls?
@tsuyoshihamano yeah subsequent crawl will update a checkpoint with additional information that may be required for next crawl, and whenever next crawl ends, it will check crawldb to find stray items or obsolete items and delete them.
It checks for items that have not been modified in the current crawl and then deletes them from crawlDb and solr collection (content collection).
Thanks @puneetkhanal ,
will give it a try now. So, for Access Controls to get rid of them in the Access Control Collection, a manual delete via deleteAccessControl()
is the only way right ?
Yeah, that is correct way
/**
* Example Usage: {@code fetchContext.newDeleteAccessControlItem(id)
* .withQuery(Collections.singletonMap("name","xyz"), false)
* .emit();
* }
*/