RAG_Hack icon indicating copy to clipboard operation
RAG_Hack copied to clipboard

Project: Cyber News Summarizer and Chatbot

Open FRAMEEE17 opened this issue 1 year ago • 4 comments

Project Name

LLM-based cyber security news summarizer and chatbot

Description

This app is a cybersecurity-focused news summarizer and chatbot designed for Security Operation Centers (SOCs). It uses a diverse set of cybersecurity-related data sources, including:

  • Cybersecurity news from platforms like The Hacker News, Dark Reading, and Security Affairs Splunk manuals, including documentation for Splunk Enterprise Security and Splunk SOAR CrowdStrike Global Threat Report 2024, which provides comprehensive analysis of global cyber threat trends

The app implements a RAG approach using various AI models, including Solar-1-mini-chat for conversations, Solar-embedding models for query and document embedding, and a model for groundedness checking. It also uses GPT-4 for news summarization. The app benefits SOC teams by providing them with:

  • Access to the latest cybersecurity news
  • Interaction with a specialized cybersecurity chatbot
  • The ability to train the chatbot with specific cybersecurity knowledge

Data Ingestion:

  1. Web Scraping and Preprocessing:
  • Scrape cybersecurity news articles from specified websites
  • Clean HTML tags and extract relevant text
  • Split long articles into smaller chunks for better retrieval
  • Generate metadata (e.g., publication date, source, category)
  1. PDF Document Processing:
  • Extract text from Splunk manuals and CrowdStrike reports (PDF format)
  • Segment long documents into smaller, coherent sections
  1. Embedding Generation:
  • Use Solar-embedding-1-large-passage to generate embeddings for each document chunk
  • Store embeddings in a vector database for efficient similarity search

Prompting Flow:

  1. Query Understanding:
  • Analyze user input to identify key cybersecurity concepts and intents
  1. Contextual Retrieval:
  • Use Solar-embedding-1-large-query to embed the user's query
  • Perform hybrid search (combination of semantic and keyword) to retrieve relevant passages and agentic RAG using Langgraph to retrieve relevant passages in our vector database
  1. Dynamic Prompt Construction:

Construct a prompt that includes:

  • The user's question
  • Retrieved relevant context
  • Instructions for the model to focus on cybersecurity-specific information
  • Include any relevant metadata (e.g., "Answer based on the latest threat intelligence from 2024")
  1. Groundedness Check: After generating a response, use Solar-1-mini-groundedness-check to verify the answer against the retrieved context. If inconsistencies are found, reformulate the response or add feedback.

Technology & Languages

  • [ ] JavaScript
  • [ ] Java
  • [ ] .NET
  • [X] Python
  • [ ] AI Studio
  • [ ] AI Search
  • [ ] PostgreSQL
  • [ ] Cosmos DB
  • [ ] Azure SQL

Project Repository URL

https://github.com/FRAMEEE17/MICROSOFT-RAG-HACK-WOLFARE

Deployed Endpoint URL

No response

Project Video

https://drive.google.com/drive/folders/1Bt2xAKujAjaXirt0-G5aMCsVrSPyE5qu

Team Members

FRAMEEE17,Mhonns

FRAMEEE17 avatar Sep 13 '24 19:09 FRAMEEE17

Hello @frameee17, thank you for participating in RAG Hack!

The team is working hard to distribute badges. Please have each team member fill out this form: aka.ms/raghack/badge-dist

Thank you!

multispark avatar Oct 23 '24 01:10 multispark

@multispark when we receive the badges? we have filled out the form week ago.

FRAMEEE17 avatar Nov 21 '24 14:11 FRAMEEE17

Hello, we are working on distributing badges, there have been some technical difficulties. Thank you for your patience.

multispark avatar Nov 25 '24 21:11 multispark

Hello, we are working on distributing badges, there have been some technical difficulties. Thank you for your patience.

We have been waiting for the badge coz I'm looking forwards to applying for an intern now. Could you speed up the process for me? Thank you

FRAMEEE17 avatar Dec 05 '24 09:12 FRAMEEE17