dataall Crawl dataset improvements - List down the the S3 subfolders to choose what to be crawled, when going for prefix option

Crawl dataset improvements - List down the the S3 subfolders to choose what to be crawled, when going for prefix option

Open enr0c opened this issue 1 year ago • 2 comments

Is your idea related to a problem? Please describe.

to crawl subfolders, one needs to take a detour and first check the exact path of the s3 subfolder in order to toggle for crawling a S3 prefix.
the user is also not very much updated on 1) progress and 2) what is being crawled
the user must afterwards hit synchronize button manually

Describe the solution you'd like

when crawling a dataset, uploaded to a subfolder of the bucket, it would be convenient to list down all subfolders and allow the user to choose one or many subfolders to crawl
a message, which folders will be crawled and the result of the crawler would be much appreciated to show
synchronizing tables after the crawler added something in glue would be also user-friendly

Jan 18 '24 14:01 enr0c

Having some type of nested tree / file structure display of a bucket's folders would be really neat and great improvement for user experience. Not only could this be used for selecting subfolders to crawl but also as a separated tab in data.all datasets to give users a quick representation of the data in their S3 Bucket.

A notification when crawling is complete and a trigger to re-sync tables would also be nice user experience add-ons - we would need to add some type of waiter logic in the backend to wait until crawler is done running and perform the subsequent steps

We will consider the above in our planning and determine how best to prioritize. Alternatively, if you have the bandwidth and want to implement it on your own, we are happy to support and guide you through the implementation and contribution process. Do let us know your thoughts!

Jan 18 '24 18:01 noah-paige

This could be implemented alongside #428

Mar 12 '24 16:03 dlpzx

dataall dataall copied to clipboard

Crawl dataset improvements - List down the the S3 subfolders to choose what to be crawled, when going for prefix option

dataall
dataall copied to clipboard