JPlag icon indicating copy to clipboard operation
JPlag copied to clipboard

Feat request: multiple programming languages

Open euberdeveloper opened this issue 1 year ago • 7 comments

As of now it seems that JPlag supports multiple programming languages, but only in a homogeneous way.

This means that I can compare two different submissions both in Java, both in Python but not one in Java and one in Python.

It could seem that it doesn't make sense, but it could actually be a type of obfuscation, translating a program from a language to another one.

Maybe Java and python are not the perfect example, but if we take into account languages such as Java and Kotlin or Scala, that all work with the JVM, this issue becomes more relevant

euberdeveloper avatar Feb 07 '24 12:02 euberdeveloper

Good point, this relates to cross-language plagiarism detection. While there has been some research in that area, there are (to my knowledge) no usable tools for that. In future, we may want to introduce that by creating a shared token type set for common concepts between languages. Thus, language modules may reuse these token types thus allowing for cross-language support. On a similar note, we may consider polyglot support, meaning parsing multi-language submissions by delegating the different files to different language modules.

tsaglam avatar Feb 07 '24 15:02 tsaglam

Hello, this has been done in this fork: https://github.com/euberdeveloper/JPlag/tree/feature/multilanguage-plagiarism-detection

A pull request will follow up in the future

euberdeveloper avatar Jul 13 '24 11:07 euberdeveloper

We have our own ideas for that, but we are happy to look at yours. Keep in mind, that these might be major changes that need to consider other upcoming changes, API considerations, and not break existing features (e.g. token sequence normalization or match merging).

tsaglam avatar Jul 15 '24 11:07 tsaglam

I think what I've done is more like a proof of concept. The pros until now are:

  • In the code, examples of the changes that should be done in order to accept as input a set of languages and not one can be seen
  • Each language interface is added with the method "supportCrossPlagiarism", to specify that that language supports it
  • Each language that supports the feature has an additional parser to general tokens
  • The code proves that on the side of the report there are not major changes

To speed up the process, I made the single language front ends use first their default language-specific tokens to get specific tokens and then I made a converter to convert those tokens to general ones. Don't do it, the results are not good and many issues could be fixed by obtaining language-agnostic tokens directly by parsing the source code from scratch. I will implement this improvement soon.

euberdeveloper avatar Jul 16 '24 14:07 euberdeveloper

Another improvement I want to do is making the language-agnostic tokens dynamic. Each language will override/implement some methods such as "supportsClasses" or "has variable declarations". For example C would return false to the first method and true for the second one. Python would return true to the first one and false to the second one. Java true to both.

Then, the langiage-agnostic tokenizers for Rach language would receive the full set of languages for this run as an additional parameter. Based on what those language support, it will change behaviour, for example if Java Python and C are provided, the java tokenizer will discard Class tokens. If only Java and Python are provided as possible languages for this run, the Java tokenizer will emit class tokens.

euberdeveloper avatar Jul 16 '24 14:07 euberdeveloper

I have some work in progress with this

euberdeveloper avatar Jul 16 '24 14:07 euberdeveloper