helpers Create an index for md files

Goal: Generate an index file listing all markdown files in the repositor

Steps:

List all markdown files and organize them (hierarchically )
Include metadata:
- File name.
- Relative path.

also handle #210

Dec 30 '24 03:12 tkpratardan

This is lower priority, since it's going to be difficult to automate efficiently. Let's start with ensuring that all links are valid and all files are referred in the README.

Jan 02 '25 18:01 gpsaggese

Review, merge

Jan 21 '25 13:01 gpsaggese

[ ] @gpsaggese
[ ] @tkpratardan Decision is to apply changes and merge, even if it's not complete

Jan 28 '25 13:01 gpsaggese

@aangelo9 the task here is to take the PR https://github.com/causify-ai/helpers/pull/237, which is already reviewed, work in the associated branch and (1) address the existing PR review, (2) afterwards lead it to merge and complete the task.

FYI @gpsaggese @samarth9008

Apr 07 '25 15:04 sonniki

Yes just re-iterating to avoid confusion / frustration.

Apply the changes in the PR requested by the previous reviewer. In this way you can get familiar with the code
Then we can decide where to go from here
- We could summarize each doc with ChatGPT (we have a pipeline for doing that)

On the other side, the project of ChatGPT-ify the documentation might make this issue superfluous. In any case let's see where we are and then we'll decide.

@aangelo9 the way of doing things is to always think hard, make a proposal, and after the plan is clear and agreed upon, execute

Apr 09 '25 16:04 gpsaggese

Thanks for the clarification.

I'll go ahead and make the changes, and once that's done, I will reasses and make a proposal.

Apr 09 '25 17:04 aangelo9

I have resolved most of the requested changes. The remaining requests needs the code to run.

I have added a unit test that is not ran yet because it calls openai. I just want to confirm if I am allowed to run it since it could cost money, or should I just run a mock test, so it omits all openai calls.

I also want to propose a change in the code:

Instead of taking the git root repo and processing everything in it, we could add an argument to a specific folder and process things per folder instead of an entire repo. This makes testing easier as well.

Apr 10 '25 01:04 aangelo9

I have resolved most of the requested changes. The remaining requests needs the code to run.

When you think it's ready, go ahead and convert the PR from draft to "ready for review" and re-request review through the GH interface.

I have added a unit test that is not ran yet because it calls openai. I just want to confirm if I am allowed to run it since it could cost money, or should I just run a mock test, so it omits all openai calls.

There is a conversation about it in the PR. See my proposal in https://github.com/causify-ai/helpers/pull/237#issuecomment-2607306977.

Instead of taking the git root repo and processing everything in it, we could add an argument to a specific folder and process things per folder instead of an entire repo. This makes testing easier as well.

Sounds good. Maybe the repo root can be the default value for the argument.

Apr 10 '25 09:04 sonniki

I decided to remove --generate_summary and --update_summary args. So the code now just refreshes the README index for all markdown files. Also added --use_placeholder_summary argv, to bypass openai usage and conduct mock tests.

This is because when testing --generate_summary, adding and removing file indexes requires too much line indexing, which makes the code unnecessarily long. I also believe that it's just better practice to refresh the README indexes to have the README up to date, since it's a once-in-a-while script for documentation.

Apr 11 '25 03:04 aangelo9

I decided to remove --generate_summary and --update_summary args. So the code now just refreshes the README index for all markdown files. Also added --use_placeholder_summary argv, to bypass openai usage and conduct mock tests.

This is because when testing --generate_summary, adding and removing file indexes requires too much line indexing, which makes the code unnecessarily long. I also believe that it's just better practice to refresh the README indexes to have the README up to date, since it's a once-in-a-while script for documentation.

Commented here. In short, IMO it's useful to have a mode to only add new files while running the script. Could you please clarify what line indexing you're referring to? It should be the matter of (1) getting all the md files, (2) getting all the md files mentioned in the readme, (3) identifying the difference between 1 and 2, (4) appending summaries of the new files to the end of the readme file.

Now that I wrote this, it would also be useful to be able to remove outdated md references from the readme, for situations when we delete or rename a doc. These files should also pop up in step (3).

Apr 11 '25 12:04 sonniki

# Repository README

## Markdown Index

This section lists all Markdown files in the repository.

### tmp.scratch

- **File Name**: welcome.md  
  **Relative Path**: [welcome.md](welcome.md)  
  **Summary**: Placeholder summary for welcome.md  

### docs

- **File Name**: intro.md  
  **Relative Path**: [docs/intro.md](docs/intro.md)  
  **Summary**: Placeholder summary for intro.md  

### docs/guide

- **File Name**: setup.md  
  **Relative Path**: [docs/guide/setup.md](docs/guide/setup.md)  
  **Summary**: Placeholder summary for setup.md  

- **File Name**: usage.md  
  **Relative Path**: [docs/guide/usage.md](docs/guide/usage.md)  
  **Summary**: Placeholder summary for usage.md

The initial readme generation creates subheaders for each folder in the repo and places each file accordingly. To make it neat, the respective documents are placed under their subheaders. This is where line indexing happens when adding or removing.

Changes to logic:

I will implement addition and subtraction of files, join them together as refresh.
Change --generate_summary to be the creation of README when does not exist.
Add additional argument for chatgpt model type.

Apr 11 '25 20:04 aangelo9

The initial readme generation creates subheaders for each folder in the repo and places each file accordingly. To make it neat, the respective documents are placed under their subheaders. This is where line indexing happens when adding or removing.

Oh, I see. IMO we can drop the dir subheaders to simplify. We can sort the file paths to make sure files from the same subdirs are located close to each other. There will still be a little bit of complexity when we need to find the correct place to insert a new file but it's the matter of finding where it would fit in an alphabetical order and then splitting the existing contents of the doc where it will be added, adding it and then joining the pieces back together. WDYT?

Apr 11 '25 20:04 sonniki

I agree. We can drop the subheaders so that when refresh is called, the existing README can be extracted, check for files, then add/remove indexes, and compile the indexes and write back to the README file.

Apr 11 '25 20:04 aangelo9

Ok with me, but I wouldn't try to make it too perfect, since the LLM search might make human-oriented documentation less important.

Apr 13 '25 23:04 gpsaggese

@sonniki PR ready to be reviewed.

Apr 14 '25 20:04 aangelo9

@sonniki PR ready to be reviewed.

Okay. As per the docs, please "re-request review" through the GH interface when the PR is ready for a new round of review.

Apr 14 '25 20:04 sonniki

My apologies, won't be making the same mistakes again.

Apr 14 '25 20:04 aangelo9

@sonniki Could I get assigned a new issue while waiting for the review.

Apr 15 '25 14:04 aangelo9

All done

Apr 18 '25 11:04 sonniki