wakaru icon indicating copy to clipboard operation
wakaru copied to clipboard

explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc)

Open 0xdevalias opened this issue 6 months ago • 6 comments

Another area I started looking into (but haven't deeply explored yet) for both figuring out how to map variable names to sections of code in a 'smart' way, and potentially also for module identification (see #41); is in the space of 'structural AST fingerprinting' or 'code similarity' algorithms and similar. (I realise that this is a rather deep/esoteric angle to be looking at this from, and that there are likely going to be far simpler/easier ways to implement the variable mapping/module identification in a 'good enough' way without going to this level of depth; but I'm curious to explore it regardless, to see if any good ideas come out of it)

I haven't gotten too far in my reading yet (got distracted on other things), but the high level of my idea was that maybe we could generate an 'AST fingerprint' that isn't impacted by the variable/function/etc names ('symbols') changing during minification; and then use that as the basis for the 'key' in the 'mappings file'; as that fingerprint could theoretically still identify a 'scope' (which might be a literal JS scope, or might be a higher level abstraction that we decide makes sense; the most abstract being probably at the bundled module level) even if the bundler decides to move some functions around to a different module/etc. Then obviously if we were able to generate those 'resilient fingerprints' to identify code even when it's been minified, that would make perfect sense to apply to module detection/etc (see #41) as well.

Some of the high level ideas / search terms that I was using to start my research in that area was things like:

  • AST fingerprinting
  • Source code similarity fingerprinting
  • Control flow graphs
  • Call flow graphs
  • Program dependence graph
  • etc

Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:

Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity
  • https://en.wikipedia.org/wiki/Program_dependence_graph
    • Program Dependence Graph - Wikipedia

    • In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.

  • https://en.wikipedia.org/wiki/Control-flow_graph
    • Control-Flow Graph - Wikipedia

    • In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.

    • In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.

  • https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint
    • Stack Overflow: Assembly-level function fingerprint (2011)

  • https://stackoverflow.com/questions/15087195/data-flow-graph-construction
    • Stack Overflow: Data Flow Graph Construction (2013)

  • https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree
    • Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)

  • https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/
    • CodeQL Documentation: About data flow analysis

    • Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.

  • https://clang.llvm.org/docs/DataFlowAnalysisIntro.html
    • Clang Documentation: Data flow analysis: an informal introduction

    • This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.

    • Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.

  • https://openreview.net/forum?id=BJxWx0NYPr
    • Adaptive Structural Fingerprints for Graph Attention Networks (2019)

    • Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating the intrinsic oversmoothing problem in graph neural networks.

  • https://dl.acm.org/doi/10.1145/3486860
    • A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)

    • Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.

  • https://inria.hal.science/hal-01648996/document
    • BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)

    • Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.

  • https://hal.science/hal-00627811/document
    • Syntax tree fingerprinting: a foundation for source code similarity detection (2011)

    • Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

  • https://ieeexplore.ieee.org/document/5090050
    • Syntax tree fingerprinting for source code similarity detection (2009)

    • Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.

    • https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf
  • https://ieeexplore.ieee.org/document/9960266
    • Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)

    • Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.

  • https://digitalcommons.calpoly.edu/theses/2040/
    • Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)

    • Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.

    • https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses
  • https://dl.acm.org/doi/10.1145/1286821.1286826
    • Dynamic graph-based software fingerprinting (2007)

    • Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.

      The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.

      In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.

    • https://dl.acm.org/doi/pdf/10.1145/1286821.1286826
  • https://patents.google.com/patent/US9459861B1/en
    • Systems and methods for detecting copied computer code using fingerprints (2016)

    • Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.

  • https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf
    • Software Fingerprinting in LLVM (2021)

    • Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.

  • https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W
    • Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)

    • The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.

  • https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf
    • Slides: Basic Program Analysis - Suman Jana

    • ChatGPT Summary / Abstract:
      • Title: Basic Program Analysis

        Author: Suman Jana

        Institution: Columbia University

        Abstract: This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.

        The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.

        Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.

        Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.

  • https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification
    • A graph-based code representation method to improve code readability classification (2023)

    • Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.

Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump to it:

  • https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc

Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057

See Also

  • https://github.com/pionxzh/wakaru/issues/41
    • https://github.com/pionxzh/wakaru/issues/41#issuecomment-1810097408
    • https://github.com/pionxzh/wakaru/issues/41#issuecomment-1836835324
    • https://github.com/pionxzh/wakaru/issues/41#issuecomment-1890296652
  • https://github.com/pionxzh/wakaru/issues/34
    • https://github.com/pionxzh/wakaru/issues/34#issuecomment-1837858619
    • https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057
    • https://github.com/pionxzh/wakaru/issues/34#issuecomment-1845916355
  • https://github.com/pionxzh/wakaru/issues/73
  • https://github.com/pionxzh/wakaru/issues/121

0xdevalias avatar Dec 12 '23 23:12 0xdevalias