lgtm_hack_scripts icon indicating copy to clipboard operation
lgtm_hack_scripts copied to clipboard

Collect a repos network dependencies

Open mrthankyou opened this issue 4 years ago • 2 comments
trafficstars

I've been researching potential queries that target misconfiguration of libraries. One feature Github has is the "Network Dependencies" feature. This provides a list of repos that utilize a given library. This is spectacular if there is a query targeting a particular use of a library. If we can figure out how to write a Github API call that collects these network dependency repos we would get more positive results.

It should be noted I have done zero research into if this API is offered by Github.

mrthankyou avatar Feb 04 '21 20:02 mrthankyou

Github doesn't have a dedicated API for this however I have found several tools that allows us to query for repo dependents. I'll investigate these tools to see what I can extract out of it.

https://github.com/github-tooling/ghtopdep

May also be worth pointing out that there is a NPM package dedicated to gathering NPM dependents. Although this ticket is meant for Github dependents I thought it worth mentioning.

mrthankyou avatar Feb 22 '21 21:02 mrthankyou

I have created a working script (branched off of #14) using ghtopdep and it works pretty well. We can grab repositories that use a particular Github repo library. This is EXTREMELY helpful when you want to find potential CVEs for misconfigured libraries. For example...

# Template
# python3 follow_network_dependency_repos.py <GITHUB_LIBRARY_REPO_URL> <CUSTOM_LIST_NAME>

# This will find all repositories (with a minimum of 5 stars) that use the Electron Remote library. 
# We then cache the results so we can later move the repositories to the `remote-cache` LGTM custom list. 
python3 follow_network_dependency_repos.py https://github.com/electron/remote remote-cache

Also, we can filter these repositories dependent on the library based on a search term or based on the number of stars in the repo. For now I've decided to just filter based on the number of stars a repository has.

Finally, as a sneak peek I've attached the python script I wrote. If all of this sounds good to you, I'll submit it as a PR once #14 is merged in. It's reliant on the code in #14.

Any thoughts are appreciated here.

Python script
from typing import List
from lgtm import LGTMSite, LGTMDataFilters

import utils.cacher
import utils.github_api

import sys
import time
import subprocess
import json

def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
    print("About to save: " + repo_name)
    # Another throttle. Considering we are sending a request to Github
    # owned properties twice in a small time-frame, I would prefer for
    # this to be here.
    time.sleep(1)

    repo_url: str = 'https://github.com/' + repo_name
    project = site.follow_repository(repo_url)
    print("Saved the project: " + repo_name)
    return project

def run_command(command: str) -> str:
    result = subprocess.check_output(command, shell=True)
    return str(result)

def format_ghtopdep_output(output: str) -> str:
    formatted_output = output.split("repositories\\r")[-1].strip()
    output_size = len(formatted_output)
    return formatted_output[:output_size - 5]

def get_network_dependency_graph_repos(repo_url: str) -> List[dict]:
    ghtopdep_command = f"ghtopdep {repo_url} --json --minstar=5 --rows=10000"
    raw_output = run_command(ghtopdep_command)
    formatted_output = format_ghtopdep_output(raw_output)
    repos = json.loads(formatted_output)
    return repos

def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
    print("About to save: " + repo_name)
    # Another throttle. Considering we are sending a request to Github
    # owned properties twice in a small time-frame, I would prefer for
    # this to be here.
    time.sleep(1)

    repo_url: str = 'https://github.com/' + repo_name
    project = site.follow_repository(repo_url)
    print("Saved the project: " + repo_name)
    return project

def find_and_save_projects_to_lgtm(repo_library_url: str):
    repos = get_network_dependency_graph_repos(repo_library_url)
    saved_project_data: List[str] = []
    site = LGTMSite.create_from_file()

    github = utils.github_api.create()

    for repo in repos:
        repo_name = repo['url'].split("https://github.com/")[1]
        time.sleep(2)
        github_repo = github.get_repo(repo_name)

        if github_repo.archived or github_repo.fork:
            continue

        saved_project = save_project_to_lgtm(site, github_repo.full_name)
        time.sleep(2)

        simple_project = LGTMDataFilters.build_simple_project(saved_project)

        if not simple_project.is_valid_project:
            continue

        saved_data = f'{simple_project.display_name},{simple_project.key},{simple_project.project_type}'
        saved_project_data.append(saved_data)

    return saved_project_data


ghtopdep_help_output = run_command("ghtopdep --help")

if not "Usage: ghtopdep [OPTIONS] URL" in ghtopdep_help_output:
    print("Please first install ghtopdep is required to run this script. Please see the ghtopdep for installation instructions: https://github.com/github-tooling/ghtopdep")
    exit

repo_library_url = sys.argv[1]
saved_project_data = find_and_save_projects_to_lgtm(repo_library_url)

# If the user provided a second arg then they want to create a custom list.
if not len(sys.argv) > 2:
    exit

custom_list_name = sys.argv[2]
utils.cacher.write_project_data_to_file(saved_project_data, custom_list_name)

mrthankyou avatar Mar 05 '21 02:03 mrthankyou