helpers icon indicating copy to clipboard operation
helpers copied to clipboard

Reorg all the EDA files

Open gpsaggese opened this issue 6 months ago • 10 comments

We often do a similar exercise for every data set we on-board (like https://github.com/causify-ai/csfy/issues/5938)

We want to have a library to perform a standard set of transformations:

  • For each column computes stats
    • Infer correct type of columns
    • How many values / percentages are non-null, non-zero,
    • The set of values
    • If it's a timestamp, distribution over time

The goal is to automate a bit the EDA part and see how much we can push the current generation of LLMs We want to use LLM to understand what columns mean and generate a template to analysis

The output is a notebook and EDA gdocs like the ones that we have generated for Enel.

gpsaggese avatar Jun 16 '25 20:06 gpsaggese

We have pretty much everything already built.

  1. The types conversion is already in hpandas

  2. There are tons of functions in

  • ./helpers_root/helpers/hpandas.py and hdataframe.py
  • Also in //cmamp there are ./core/explore.py, ./core/test/test_explore.py

The steps are:

  • [ ] create a catalogue of the classes of "all the functions for EDA" we have
  • [ ] reorg properly
  • [ ] then unit test them
  • [ ] finally create LLMs that given a dataframe and a library of functions generate a script to perform analysis

This is like an AutoEDA agent.

gpsaggese avatar Jun 17 '25 20:06 gpsaggese

We often do a similar exercise for every data set we on-board (like https://github.com/causify-ai/csfy/issues/5938)

  • Also in //cmamp there are ./core/explore.py, ./core/test/test_explore.py

@gpsaggese I do not have access to the cmamp repo, could you clarify a bit more about the issue and the scripts.

aangelo9 avatar Jun 23 '25 22:06 aangelo9

@aangelo9 we can't add you to cmamp for compliance reason. Let's pass this to @Shaunak01 and he can pass create tasks to @aangelo9 that are ok given his access privilege

The first step is to create a gdoc with a list of all the "classes" of EDA functions we have, e.g.,

  • data conversion
  • compute EDA statistics
  • transform
  • visualization

What are the files that contain interesting stuff? See list of files above

FYI @sonniki @gitpaulsmith

gpsaggese avatar Jun 26 '25 11:06 gpsaggese

I'll compile the list for the tutorials and helpers repo.

aangelo9 avatar Jun 26 '25 20:06 aangelo9

Sounds good. You can work with @Shaunak01 on this, since there can be even more code in cmamp in core, so you guys need to jump through some hoops across repos. We can do things incrementally and start from helpers.

What happens is that people write code in a local utils, because they don't know where are all the functions. So we can

  • either clean up after them after the reorg

  • move the general code to //helpers

  • [ ] It might make sense to even do a grep of all the functions that use matplotlib in the code base, the ones that process a df, etc.

gpsaggese avatar Jun 26 '25 23:06 gpsaggese

Gdoc with list of EDA functions: https://docs.google.com/document/d/1ChZ8jvoHC0pUvQfAvs7XI-BjykIz1g8ARp3LMr4UvQA/edit.

aangelo9 avatar Jun 30 '25 23:06 aangelo9

We want to make a script that scrapes EDA functions, then gets the lines and docstring for each function.

Specs

Create generate_context.py to:

  • Scrape entire .py files
  • Extract function metadata: Function Type, Script Path, Function Name, Lines, and Docstring

Main Arguments

  • --in-file: Input markdown file containing partial metadata: Function Type, Script Path, Function Name.
  • --out-file: Output markdown file. Includes: Function Type, Script Path, Function Name, Lines, Docstring.

FYI @gpsaggese @PranavShashidhara

aangelo9 avatar Aug 01 '25 15:08 aangelo9

Not sure about the OpenAI part. The type of the files are probably assigned by the user. In practice the user creates a file with file name, file type, function name and the script completes the rest of the information parsing the code and completing the info.

Then the file is fed to an LLM to direct it.

Makes sense?

gpsaggese avatar Aug 01 '25 22:08 gpsaggese

Understood, so no need to automate scrapping functions. Just grab the lines and doctring. @PranavShashidhara

aangelo9 avatar Aug 01 '25 22:08 aangelo9

Moving this task into https://github.com/causify-ai/tutorials/issues/629.

aangelo9 avatar Aug 20 '25 22:08 aangelo9