vulnerablecode icon indicating copy to clipboard operation
vulnerablecode copied to clipboard

Process unstructured data sources

Open pombredanne opened this issue 5 years ago • 7 comments

These contain valuable data nuggets among an ocean of junk and we need to be able to find the good things there.

Some sources are:

  • mailing lists such as:
    • https://github.com/nexB/vulnerablecode/issues/100
    • https://github.com/nexB/vulnerablecode/issues/104
    • https://github.com/nexB/vulnerablecode/issues/108
  • changelogs https://github.com/nexB/vulnerablecode/issues/233
  • reflogs of commit (see also the commits from vulncodedb and SAP/Eclipse steady KB)
  • bug and issue trackers (such as Django, etc)
  • actual description of a CVE or the text body of advisories. See https://github.com/nexB/vulnerablecode/issues/551

We can either automate it all, but that's going to be super difficult, or rather start to craft a curation queue and parse as much as we can to make it easy to curate by humans

  • https://github.com/nexB/vulnerablecode/issues/218

... and progressively also improve some mini AI and classification to help further automate the work.

pombredanne avatar Sep 10 '20 09:09 pombredanne

A reference: https://hal.science/hal-03430826/document

AyanSinhaMahapatra avatar Feb 09 '23 17:02 AyanSinhaMahapatra

Interested in the Project Idea...

ThePhilosopher4097 avatar Mar 27 '23 12:03 ThePhilosopher4097

Interested in the Project Idea... I think, processing of changelogs, reflogs of commits and mailing list data can be a automated

ThePhilosopher4097 avatar Mar 27 '23 12:03 ThePhilosopher4097

Please also check: https://github.com/cve-search/git-vuln-finder

TG1999 avatar Jan 23 '24 16:01 TG1999

https://github.com/pyupio/changelogs

TG1999 avatar Jan 23 '24 16:01 TG1999

I guess the process of change logs of Apache mailing list can be automated using OpenAI' API or other open source LLMs, where we scrape the data using Selenium, feed into LLM, get the output as json format and then update the database accordingly. What is your view on that @pombredanne . #218 Can also be implemented.

ykodwani01 avatar Mar 15 '24 18:03 ykodwani01

Automating the extraction of valuable information from Apache mailing list changelogs using OpenAI’s API and other tools is a great initiative and I think for the unstructured data we can focus primarily onto the Dataset for feature Engineering and classified into diverse group

Model Training: Fine-tune the selected model on a prepared dataset of CVEs in code. This will help the model learn to identify vulnerabilities in the unstructured data..... As well as we can use LoRA for the model to train

Vulnerability Detection: Use the trained model to parse through the unstructured data and identify potential vulnerabilities. This could involve using NLP techniques to understand the vulnerability descriptions and infer the vulnerable package name and versions.

Most Important Parameter to be checked is this Text Classification: This involves categorizing text into predefined groups. [In vulnerability detection, this could be used to classify descriptions as either indicating a vulnerability or not

Information Extraction: This is the process of automatically extracting structured information from unstructured text data.

@pombredanne @AyanSinhaMahapatra

Suraj209211 avatar Mar 15 '24 19:03 Suraj209211