GSoC
GSoC copied to clipboard
Create Chat Bot Interface Trained On Documentation Site
Background:
- cBioPortal: cBioPortal is an open-source platform for cancer genomics data analysis and visualization. It provides a centralized resource for exploring and analyzing large-scale cancer genomic data sets, including genomic alterations, gene expression, and clinical information. The platform integrates data from multiple sources, including The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), and makes it available through a web interface for researchers, clinicians, and the general public. Please refer to the cBioPortal home page for an overview.
- cBioPortal has lots of documentation available (https://docs.cbioportal.org/) on how (1) to install and configure cBioPortal locally, (2) use cBioPortal as a user, (3) programmatically use the API. Searching through the documentation is not always straightforward and we often get questions on the user group (https://groups.google.com/g/cbioportal) where we mainly point to a link in the docs. A chat interface might be a good solution for giving users quicker feedback on what they are searching for
Goal:
- Build a chat bot interface for cBioPortal's Documentation
Approach:
- Train the model on our documentation site (here is an example blog)
- Also train the chatbot based on the google group conversations we had
- Evaluate different models for this purpose
- Integrate a chat interface into the main website (350h project)
Need skills: Familiarity with the command line and the use of APIs
Possible mentors: @inodb @walleXD
Hey @inodb I think my skills are similiar as the project can you assign me ?
I know I am new to this but I am currently building a personal virtual assistant in python language for my minor project 2 in college and I have a good command in Java too. I am an AIML student, the knowledge of which will help me train the model for your chatbot. I have good command in python, Java, Machine learning, NLP and AI algorithms. Since I am currently working on my minor 2 project right now and it is not completed, I am attaching my documentation till now and the code till now for reference MINOR.docx synopsis presentation short.pptx Software Requirements Specification.docx
Hi! I’m Kamran Ayesh, a CSE final student at Indian Institute of Information Technology Guwahati, India. I have written a well explained proposal for chatbot interface trained on documentation site. I am hoping for feedback or any queries from you soon. I am very well suited for contributing to this project as during my internship I have made a virtual assistant with robust UI. Being a developer this project will enhance my skills and give better exposure to open-source.
Looking forward to contributing!
Thanks, Kamran Ayesh
I'm interested in helping to build a chatbot, I am Nisarg Patel, a CSE 2nd year university student I would like to contribute in building this chatbot. I am new at this but I am ready to learn and help for the cause and this will help me improve.
Looking forward for your response!
Thanks, Nisarg Patel
Hello i'm interested in helping to develope this chatbot. How can i apply as gsoc contributor? plzzz
Hey.. is this thing done or not? Igave good experience with making chatbots and also I am good with mern stack, so I can even integrate it with your website
On Sat, 18 Nov 2023, 6:50 am j4m3s 4l4r1c, @.***> wrote:
Euh... Sorry but what are you talking about?
On Fri, Nov 17, 2023, 15:43 Vidit Jain @.***> wrote:
Hey.. is this thing done? If not I can still make it.
— Reply to this email directly, view it on GitHub https://github.com/cBioPortal/GSoC/issues/102#issuecomment-1816553428,
or unsubscribe < https://github.com/notifications/unsubscribe-auth/APVQMSVFPM3FIDWXJTXIFK3YE5Z2JAVCNFSM6AAAAAAWPU2LPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJWGU2TGNBSHA>
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/cBioPortal/GSoC/issues/102#issuecomment-1817307595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUFRT7MC4JEZ6ZDD4ZLGMR3YFAEPFAVCNFSM6AAAAAAWPU2LPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGMYDONJZGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi,,,i am working on similar use case for my pipeline..where i am building a chatbot to scrape through the documents in my pipeline..i really would like to solve the above issue
Hi @inodb , I'm a CS grad at NYU with a solid grasp of ML, PyTorch, and CLI. I've worked on LLMs for zero-shot classification on food ingredient data. I believe using LLMs for retrieval augmented generation is highly applicable to your use-case. How would you advise me to get started on this?
Hello @inodb, I feel we can use Retrieval-Augmented Generation (RAG) technique instead of fine tuning or training. Since the documentation or knowledge base gets updated now and then, fine tuning the LLM could be costly. Moreover, RAG technique is more reliable as it has up to date knowledge. I'm a CS grad at UCM with huge interest in LLMs and Generative AI. I would like to work on this issue could you give me some leads?
Hey all! I am Ilan, a Data Science grad from the Technion. I would love to contribute to this project.
@inodb As a first step, I wanted to ask if you already thought on how you were going to structure the documentation as data for training. If so, I would love to get am example, If not I think that could be a good step to begin with. Also I would like to know if it's possible to share the documention in some easyto work with format that you might have on the backend. If not, I can just go scraping it straight from the webpage.
Anyway, would love to get some suggestions on what should be the first steps to start getting familiar with the project.
Thanks Ilan Meissonnier
BTW The Medium link given as example blog is member only :(. The following blog seems like a pretty similar (hard to tell as couldn't read the original LOL). Hope this is helpful.
Ilan
Hey all! I have been thinking about this project a bit and I have some interesting thoughts I'd like to share...
If I was using a chatbot to help me navigate documentation, I would prefer if it would be able to provide me a link to the documentation page where it learned the info from. This way I am able to fact check it and/or read further into the problem I'm having. As we know, LLMs are not always accurate and can sometimes be quite confident even when wrong. While it can be possible to train the chatbot to retrieve a link as well as answer a question (by structuring the training data in such a way), this task might be more simply solved using traditional information retrieval techniques. i.e retrieving the page that best matches a user query from a search bar (I have noticed that the search bar on the documentation webpage is not functional atm). This of course gets more complicated if you want to include answers from the google group conversations, but this approach should definitely be considered. Another option might be trying to combine both approaches together in some way, although we need to decide exactly how to do that.
Would love to hear what everyone thinks about this, or if there might be something I'm missing. Would specifically love to hear your insights on this @inodb.
Sorry for the long post, Ilan Meissonnier
Hey!
I am khavin. I am a Artificial Intelligence (AI) student currently pursuing a dual Bachelor of Science in data science at the Indian Institute of Technology Madras (IIT Madras) and Sathyabama Institute of Science and Technology. My have high interest in machine learning ,Artificial intellignce ,Neuromorphic computing
I possess extensive experience working with PyTorch and have successfully built chatbots using Google AI Studio. This has given me some experience on how to train and build chatbots. I think this experience is useful for this application and provide further experience to me on real world applications of AI
Looking forward for open source contributing!
Regards, Khavin S
Hey all, I have made a prototype for for a chatbot using RAG. I think RAG could be a pretty good approach for this project. I'm sharing this prototype as a link for a kaggle notebook if you are intrested, be sure to leave any interesting feedback that you may have.
https://www.kaggle.com/code/ilanmeissonnier/rag-for-cbioportal-documentation-chatbot
Ilan Meissonnier
I have also came across a research paper that came out a few days ago suggesting a method called Research Augmented Fine Tuning (RAFT). I am still not done reading through it but it already seems like it could be a really good approach for this.
Link to the paper 😅