Create an index for md files
Goal: Generate an index file listing all markdown files in the repositor
Steps:
- List all markdown files and organize them (hierarchically )
- Include metadata:
- File name.
- Relative path.
also handle #210
This is lower priority, since it's going to be difficult to automate efficiently. Let's start with ensuring that all links are valid and all files are referred in the README.
Review, merge
- [ ] @gpsaggese
- [ ] @tkpratardan Decision is to apply changes and merge, even if it's not complete
@aangelo9 the task here is to take the PR https://github.com/causify-ai/helpers/pull/237, which is already reviewed, work in the associated branch and (1) address the existing PR review, (2) afterwards lead it to merge and complete the task.
FYI @gpsaggese @samarth9008
Yes just re-iterating to avoid confusion / frustration.
- Apply the changes in the PR requested by the previous reviewer. In this way you can get familiar with the code
- Then we can decide where to go from here
- We could summarize each doc with ChatGPT (we have a pipeline for doing that)
On the other side, the project of ChatGPT-ify the documentation might make this issue superfluous. In any case let's see where we are and then we'll decide.
@aangelo9 the way of doing things is to always think hard, make a proposal, and after the plan is clear and agreed upon, execute
Thanks for the clarification.
I'll go ahead and make the changes, and once that's done, I will reasses and make a proposal.
I have resolved most of the requested changes. The remaining requests needs the code to run.
I have added a unit test that is not ran yet because it calls openai. I just want to confirm if I am allowed to run it since it could cost money, or should I just run a mock test, so it omits all openai calls.
I also want to propose a change in the code:
- Instead of taking the git root repo and processing everything in it, we could add an argument to a specific folder and process things per folder instead of an entire repo. This makes testing easier as well.
I have resolved most of the requested changes. The remaining requests needs the code to run.
When you think it's ready, go ahead and convert the PR from draft to "ready for review" and re-request review through the GH interface.
I have added a unit test that is not ran yet because it calls openai. I just want to confirm if I am allowed to run it since it could cost money, or should I just run a mock test, so it omits all openai calls.
There is a conversation about it in the PR. See my proposal in https://github.com/causify-ai/helpers/pull/237#issuecomment-2607306977.
Instead of taking the git root repo and processing everything in it, we could add an argument to a specific folder and process things per folder instead of an entire repo. This makes testing easier as well.
Sounds good. Maybe the repo root can be the default value for the argument.
I decided to remove --generate_summary and --update_summary args. So the code now just refreshes the README index for all markdown files.
Also added --use_placeholder_summary argv, to bypass openai usage and conduct mock tests.
This is because when testing --generate_summary, adding and removing file indexes requires too much line indexing, which makes the code unnecessarily long. I also believe that it's just better practice to refresh the README indexes to have the README up to date, since it's a once-in-a-while script for documentation.
I decided to remove
--generate_summaryand--update_summaryargs. So the code now just refreshes the README index for all markdown files. Also added--use_placeholder_summaryargv, to bypass openai usage and conduct mock tests.This is because when testing
--generate_summary, adding and removing file indexes requires too much line indexing, which makes the code unnecessarily long. I also believe that it's just better practice to refresh the README indexes to have the README up to date, since it's a once-in-a-while script for documentation.
Commented here. In short, IMO it's useful to have a mode to only add new files while running the script. Could you please clarify what line indexing you're referring to? It should be the matter of (1) getting all the md files, (2) getting all the md files mentioned in the readme, (3) identifying the difference between 1 and 2, (4) appending summaries of the new files to the end of the readme file.
Now that I wrote this, it would also be useful to be able to remove outdated md references from the readme, for situations when we delete or rename a doc. These files should also pop up in step (3).
# Repository README
## Markdown Index
This section lists all Markdown files in the repository.
### tmp.scratch
- **File Name**: welcome.md
**Relative Path**: [welcome.md](welcome.md)
**Summary**: Placeholder summary for welcome.md
### docs
- **File Name**: intro.md
**Relative Path**: [docs/intro.md](docs/intro.md)
**Summary**: Placeholder summary for intro.md
### docs/guide
- **File Name**: setup.md
**Relative Path**: [docs/guide/setup.md](docs/guide/setup.md)
**Summary**: Placeholder summary for setup.md
- **File Name**: usage.md
**Relative Path**: [docs/guide/usage.md](docs/guide/usage.md)
**Summary**: Placeholder summary for usage.md
The initial readme generation creates subheaders for each folder in the repo and places each file accordingly. To make it neat, the respective documents are placed under their subheaders. This is where line indexing happens when adding or removing.
Changes to logic:
- I will implement addition and subtraction of files, join them together as
refresh. - Change
--generate_summaryto be the creation of README when does not exist. - Add additional argument for chatgpt model type.
The initial readme generation creates subheaders for each folder in the repo and places each file accordingly. To make it neat, the respective documents are placed under their subheaders. This is where line indexing happens when adding or removing.
Oh, I see. IMO we can drop the dir subheaders to simplify. We can sort the file paths to make sure files from the same subdirs are located close to each other. There will still be a little bit of complexity when we need to find the correct place to insert a new file but it's the matter of finding where it would fit in an alphabetical order and then splitting the existing contents of the doc where it will be added, adding it and then joining the pieces back together. WDYT?
I agree. We can drop the subheaders so that when refresh is called, the existing README can be extracted, check for files, then add/remove indexes, and compile the indexes and write back to the README file.
Ok with me, but I wouldn't try to make it too perfect, since the LLM search might make human-oriented documentation less important.
@sonniki PR ready to be reviewed.
@sonniki PR ready to be reviewed.
Okay. As per the docs, please "re-request review" through the GH interface when the PR is ready for a new round of review.
My apologies, won't be making the same mistakes again.
@sonniki Could I get assigned a new issue while waiting for the review.
All done