azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

feature: import only additional documents for datalake

Open cforce opened this issue 1 year ago • 1 comments

For local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.

The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.

Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.

cforce avatar Oct 23 '24 09:10 cforce