ragas [R-263] Roadmap

[ ] #1010
- [ ] Reference free
  - [ ] Generation
    - [ ] Summarisation
      - [ ] Code summary
      - [ ] Textual summary
- [ ] With Reference
  - [ ] #1220
    - [ ] Text
      - [ ] answer correctness
    - [ ] Code
    - [ ] SQL
[ ] #1011
- [ ] make ragas metrics deployable as a server
- [ ] make testset generation interactive with an API
[ ] #1018
[ ] #1012
[ ] #1015
[ ] #1016
- [ ] for RAG
  - [ ] structured data
  - [ ] unstructured data
- [ ] Agents simulations
  - [ ] Based on predefined task & conditions
- [ ] State to persist knowledge graphs and results in test generation
[ ] #1237

_{From SyncLinear.com | R-263}

Jun 05 '24 18:06 jjmachan

We lack chunk quality metrics as of today. It will be good to see some chunk quality evaluation metrics.

Jun 06 '24 14:06 rajib76

hey @rajib76, thanks for chipping in 🙂

could you explain a bit more about how you're measuring quality here? maybe an example too if possible?

Jun 08 '24 06:06 jjmachan

One of the hard problem today in RAG is to determine the right size of the chunk. If a chunk talks about multiple concept, it is very difficult to find the most relevant chunk for the question. I was looking for a metrics that will tell that a chunk is atomic and it talks about only one concept. The semantic chunking approach did not work as the embedding model itself has a semantic dissonance.

Jun 08 '24 07:06 rajib76

@jjmachan @rajib76 metrics like chunk_attribution and chunk_utilization (as referenced here) could help to quantify chunk quality. We already have relevance scores(from vector DBs or keyword search engines) to measure chunk relevance with respect to the query. But metrics to quantify how much of the chunk was used can be helpful. I can can take this up you find them useful, I found it interesting, could help decide how many chunks to retrieve.

Aug 18 '24 18:08 sky-2002

This will be useful Akash.this will help in chunk tuning.

On Sun, Aug 18, 2024 at 11:03 AM Aakash Thatte @.***> wrote:

@jjmachan https://github.com/jjmachan @rajib76 https://github.com/rajib76 metrics like chunk_attribution and chunk_utilization (as referenced here https://docs.rungalileo.io/galileo/gen-ai-studio-products/guardrail-store/chunk-attribution) could help to quantify chunk quality. We already have relevance scores(from vector DBs or keyword search engines) to measure chunk relevance with respect to the query. But metrics to quantify how much of the chunk was used can be helpful. I can can take this up you find them useful, I found it interesting, could help decide how many chunks to retrieve.

— Reply to this email directly, view it on GitHub https://github.com/explodinggradients/ragas/issues/1009#issuecomment-2295343410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4VIRC6RTYRBPFEA76LKY3ZSDOYZAVCNFSM6AAAAABI3G2TLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJVGM2DGNBRGA . You are receiving this because you were mentioned.Message ID: @.***>

Aug 18 '24 23:08 rajib76

@jjmachan hey, thanks for the superb library and effort!

I want to use RAGAS to evaluate my open-source RAG application that has it's own custom chunker and retriever. Do you consider feasible adding support for custom-chunks to the synthetic data generator?

Right now I can't really use ragas fully because I need to rely on the chunks generated by ragas, instead of my own chunker.

Oct 02 '24 03:10 alexander-zuev

hey @Twist333d - thanks for the kind words ❤️ we just revamped the testset generation piece for v0.2 - we just released an beta version - but main version should be out next week

do you want to give that a go?

Oct 05 '24 18:10 jjmachan

yep @jjmachan shoot it of course!

Btw, I've just setup RAGAS to be used with Weave, and another feature request came up - it would be great if you supported a much easier integration with tracing & eval suites such as Weave by W&B.

Oct 06 '24 17:10 alexander-zuev

Several more feature requests:

Allow random or 'smart' sampling of samples to be used to generate questions. For example, I want to generate a test dataset for a set of documents. Depending on the volume (1 page vs 10000 pages), I want to be able to control how / where do the questions come from
Async / parallel generation of embeddings
Control over how many embeddings are generated for the input file to the dataset generator - as I understand right now it converts all nodes to embeddings which might be too costly or not necessary. For example, if I set test_size==1, why does it convert all nodes to embeddings

Oct 06 '24 18:10 alexander-zuev

[R-263] Roadmap - v0.2