dataall
dataall copied to clipboard
Crawl dataset improvements - List down the the S3 subfolders to choose what to be crawled, when going for prefix option
Is your idea related to a problem? Please describe.
- to crawl subfolders, one needs to take a detour and first check the exact path of the s3 subfolder in order to toggle for crawling a S3 prefix.
- the user is also not very much updated on 1) progress and 2) what is being crawled
- the user must afterwards hit synchronize button manually
Describe the solution you'd like
- when crawling a dataset, uploaded to a subfolder of the bucket, it would be convenient to list down all subfolders and allow the user to choose one or many subfolders to crawl
- a message, which folders will be crawled and the result of the crawler would be much appreciated to show
- synchronizing tables after the crawler added something in glue would be also user-friendly
Having some type of nested tree / file structure display of a bucket's folders would be really neat and great improvement for user experience. Not only could this be used for selecting subfolders to crawl but also as a separated tab in data.all datasets to give users a quick representation of the data in their S3 Bucket.
A notification when crawling is complete and a trigger to re-sync tables would also be nice user experience add-ons - we would need to add some type of waiter logic in the backend to wait until crawler is done running and perform the subsequent steps
We will consider the above in our planning and determine how best to prioritize. Alternatively, if you have the bandwidth and want to implement it on your own, we are happy to support and guide you through the implementation and contribution process. Do let us know your thoughts!
This could be implemented alongside #428