whale icon indicating copy to clipboard operation
whale copied to clipboard

Generation of Html Documentation

Open rubenssoto opened this issue 4 years ago • 4 comments

Hello :)

There is a great software that I use in my company called Great Expectations, its a tool to check data quality. They have a feature called data docs, it is HTML documentation about data quality checks, I host all html in an s3 bucket and all company could access.

https://greatexpectations.io/

Whale could have a feature like this, simple html with all table documentation and with some simple fields to search data.

thank you

rubenssoto avatar Oct 31 '20 02:10 rubenssoto

https://docs.greatexpectations.io/en/latest/reference/core_concepts/data_docs.html

rubenssoto avatar Oct 31 '20 06:10 rubenssoto

Hm I'll look into how feasible this might be in a low-effort way!

If the goal is just to make a basic interface available to others, I recently discovered gotty, which allows you to serve terminal apps on the web. It basically just lets users access the whale CLI from your browser (and it seems to support concurrent usage). I did some basic tests and it seems to work pretty nicely. If this sort of thing is sufficient, I can write up some quick docs. 😛

I'll look into rendering options as well, but until I/someone can get around to this, here are a few other options (@rubenssoto I think I mentioned these to you, so I'm guessing they're probably not satisfactory, but listing them here in case others are interested 😉 ):

  • If you use github, gitlab, aws codecommit etc., you can push your code there, and then leverage their markdown rendering + search capabilities (for instructions on how to set this up to function automatically using CI/CD pipelines, see the docs here).
  • Whale's parent company, Dataframe, has a hosted platform in the works (the catalog will be free for our early users) -- you can sign up for the waitlist here, and you'll be able to have a nice GUI with much richer collaborative functionality in the next few months.
  • A final option is Amundsen, which has a GUI, but it'll be quite a lot more work to set up (you'll need to set up a scheduler like airflow, write and manage the code to run the scraping job yourself, and manage around 6 or 7 microservices). Keeping your data backed up and stable in these sorts of self-hosted platforms will also require a bit of work as well.

(I'll start learning react in the meantime 😄 )

rsyi avatar Oct 31 '20 21:10 rsyi

No problem @rsyi , I will try to use Git for now until data catalog interface is ready 👍 I like Amundsen but is much to take care, my team is only 3 people our goal is to make things simple and automatic.

I think that you already registered me in a beta list, [email protected].

I have some suggestion if make sense, please tell me, I will create an issue for it.

1 - Today all tables stay on same directory, so I think it could be more organized if had an option to create one directory for database. 2 - I don't if another sources has, but glue has location information, and it is a good info for example to people know table locality in datalake.

rubenssoto avatar Oct 31 '20 23:10 rubenssoto

Ah didn't know glue had additional info! Yeah both of those suggestions sound feasible. Open some issues and I'll take a look :)

rsyi avatar Nov 01 '20 21:11 rsyi