Reorg all the EDA files
We often do a similar exercise for every data set we on-board (like https://github.com/causify-ai/csfy/issues/5938)
We want to have a library to perform a standard set of transformations:
- For each column computes stats
- Infer correct type of columns
- How many values / percentages are non-null, non-zero,
- The set of values
- If it's a timestamp, distribution over time
The goal is to automate a bit the EDA part and see how much we can push the current generation of LLMs We want to use LLM to understand what columns mean and generate a template to analysis
The output is a notebook and EDA gdocs like the ones that we have generated for Enel.
We have pretty much everything already built.
-
The types conversion is already in hpandas
-
There are tons of functions in
- ./helpers_root/helpers/hpandas.py and hdataframe.py
- Also in //cmamp there are ./core/explore.py, ./core/test/test_explore.py
The steps are:
- [ ] create a catalogue of the classes of "all the functions for EDA" we have
- [ ] reorg properly
- [ ] then unit test them
- [ ] finally create LLMs that given a dataframe and a library of functions generate a script to perform analysis
This is like an AutoEDA agent.
We often do a similar exercise for every data set we on-board (like https://github.com/causify-ai/csfy/issues/5938)
- Also in //cmamp there are ./core/explore.py, ./core/test/test_explore.py
@gpsaggese I do not have access to the cmamp repo, could you clarify a bit more about the issue and the scripts.
@aangelo9 we can't add you to cmamp for compliance reason.
Let's pass this to @Shaunak01 and he can pass create tasks to @aangelo9 that are ok given his access privilege
The first step is to create a gdoc with a list of all the "classes" of EDA functions we have, e.g.,
- data conversion
- compute EDA statistics
- transform
- visualization
What are the files that contain interesting stuff? See list of files above
FYI @sonniki @gitpaulsmith
I'll compile the list for the tutorials and helpers repo.
Sounds good. You can work with @Shaunak01 on this, since there can be even more code in cmamp in core, so you guys need to jump through some hoops across repos. We can do things incrementally and start from helpers.
What happens is that people write code in a local utils, because they don't know where are all the functions. So we can
-
either clean up after them after the reorg
-
move the general code to
//helpers -
[ ] It might make sense to even do a grep of all the functions that use matplotlib in the code base, the ones that process a df, etc.
Gdoc with list of EDA functions: https://docs.google.com/document/d/1ChZ8jvoHC0pUvQfAvs7XI-BjykIz1g8ARp3LMr4UvQA/edit.
We want to make a script that scrapes EDA functions, then gets the lines and docstring for each function.
Specs
Create generate_context.py to:
- Scrape entire
.pyfiles - Extract function metadata:
Function Type,Script Path,Function Name,Lines, andDocstring
Main Arguments
--in-file: Input markdown file containing partial metadata:Function Type, Script Path, Function Name.--out-file: Output markdown file. Includes:Function Type, Script Path, Function Name, Lines, Docstring.
FYI @gpsaggese @PranavShashidhara
Not sure about the OpenAI part. The type of the files are probably assigned by the user. In practice the user creates a file with file name, file type, function name and the script completes the rest of the information parsing the code and completing the info.
Then the file is fed to an LLM to direct it.
Makes sense?
Understood, so no need to automate scrapping functions. Just grab the lines and doctring. @PranavShashidhara
Moving this task into https://github.com/causify-ai/tutorials/issues/629.