lgtm_hack_scripts
lgtm_hack_scripts copied to clipboard
Collect a repos network dependencies
I've been researching potential queries that target misconfiguration of libraries. One feature Github has is the "Network Dependencies" feature. This provides a list of repos that utilize a given library. This is spectacular if there is a query targeting a particular use of a library. If we can figure out how to write a Github API call that collects these network dependency repos we would get more positive results.
It should be noted I have done zero research into if this API is offered by Github.
Github doesn't have a dedicated API for this however I have found several tools that allows us to query for repo dependents. I'll investigate these tools to see what I can extract out of it.
https://github.com/github-tooling/ghtopdep
May also be worth pointing out that there is a NPM package dedicated to gathering NPM dependents. Although this ticket is meant for Github dependents I thought it worth mentioning.
I have created a working script (branched off of #14) using ghtopdep and it works pretty well. We can grab repositories that use a particular Github repo library. This is EXTREMELY helpful when you want to find potential CVEs for misconfigured libraries. For example...
# Template
# python3 follow_network_dependency_repos.py <GITHUB_LIBRARY_REPO_URL> <CUSTOM_LIST_NAME>
# This will find all repositories (with a minimum of 5 stars) that use the Electron Remote library.
# We then cache the results so we can later move the repositories to the `remote-cache` LGTM custom list.
python3 follow_network_dependency_repos.py https://github.com/electron/remote remote-cache
Also, we can filter these repositories dependent on the library based on a search term or based on the number of stars in the repo. For now I've decided to just filter based on the number of stars a repository has.
Finally, as a sneak peek I've attached the python script I wrote. If all of this sounds good to you, I'll submit it as a PR once #14 is merged in. It's reliant on the code in #14.
Any thoughts are appreciated here.
Python script
from typing import List
from lgtm import LGTMSite, LGTMDataFilters
import utils.cacher
import utils.github_api
import sys
import time
import subprocess
import json
def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
print("About to save: " + repo_name)
# Another throttle. Considering we are sending a request to Github
# owned properties twice in a small time-frame, I would prefer for
# this to be here.
time.sleep(1)
repo_url: str = 'https://github.com/' + repo_name
project = site.follow_repository(repo_url)
print("Saved the project: " + repo_name)
return project
def run_command(command: str) -> str:
result = subprocess.check_output(command, shell=True)
return str(result)
def format_ghtopdep_output(output: str) -> str:
formatted_output = output.split("repositories\\r")[-1].strip()
output_size = len(formatted_output)
return formatted_output[:output_size - 5]
def get_network_dependency_graph_repos(repo_url: str) -> List[dict]:
ghtopdep_command = f"ghtopdep {repo_url} --json --minstar=5 --rows=10000"
raw_output = run_command(ghtopdep_command)
formatted_output = format_ghtopdep_output(raw_output)
repos = json.loads(formatted_output)
return repos
def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
print("About to save: " + repo_name)
# Another throttle. Considering we are sending a request to Github
# owned properties twice in a small time-frame, I would prefer for
# this to be here.
time.sleep(1)
repo_url: str = 'https://github.com/' + repo_name
project = site.follow_repository(repo_url)
print("Saved the project: " + repo_name)
return project
def find_and_save_projects_to_lgtm(repo_library_url: str):
repos = get_network_dependency_graph_repos(repo_library_url)
saved_project_data: List[str] = []
site = LGTMSite.create_from_file()
github = utils.github_api.create()
for repo in repos:
repo_name = repo['url'].split("https://github.com/")[1]
time.sleep(2)
github_repo = github.get_repo(repo_name)
if github_repo.archived or github_repo.fork:
continue
saved_project = save_project_to_lgtm(site, github_repo.full_name)
time.sleep(2)
simple_project = LGTMDataFilters.build_simple_project(saved_project)
if not simple_project.is_valid_project:
continue
saved_data = f'{simple_project.display_name},{simple_project.key},{simple_project.project_type}'
saved_project_data.append(saved_data)
return saved_project_data
ghtopdep_help_output = run_command("ghtopdep --help")
if not "Usage: ghtopdep [OPTIONS] URL" in ghtopdep_help_output:
print("Please first install ghtopdep is required to run this script. Please see the ghtopdep for installation instructions: https://github.com/github-tooling/ghtopdep")
exit
repo_library_url = sys.argv[1]
saved_project_data = find_and_save_projects_to_lgtm(repo_library_url)
# If the user provided a second arg then they want to create a custom list.
if not len(sys.argv) > 2:
exit
custom_list_name = sys.argv[2]
utils.cacher.write_project_data_to_file(saved_project_data, custom_list_name)