JPlag Core algorithm refactoring

This PR aims to refactor the core algorithm to make it more readable and maintainable. The following refactorings were performed:

move state Token.marked into algorithm
remove optimisations to reduce RAM usage in form of reversed hash tables only being generated when necessary
remove TokenHashMap which only differed from a normal map by using a modulo on the keys which had no benefit for the algorithm
encapsulate Token.hash, TokenList.tokenHashes, TokenList.hashLength into separate class SubsequenceHashLookupTable with improved naming and documentation
remove TokenList and replace it with List<Token>
move state Token.isBaseCode into algorithm
- refactored base code tagging to occur outside of the core algorithm
- only tag base code for compared submission, not for the base code submission

Discussion

`TokenList` removal

With the removal of TokenList come two minor changes. First, the performed string optimisation to point to the same String object in case the file name for multiple tokens is the same was removed. I don't think this optimisation is necessary with modern-day computers anymore. Second, a token list adjusted the tokens to have monotonically non-decreasing line numbers. This is only relevant for the visualisation, as the matching does not use the line number information. With some minor testing, the adjustment behaviour seemed to trigger only for FILE_END tokens which was fixed by assigning a NO_VALUE line to any FILE_END token. Since FILE_END token cannot appear in any match anyways, it should not make any difference for the visualisation of results.

Choice of actual `Set` / `Map` class

The refactorings introduce various variables of type Set or Map. To tune performance, we could test different types of these classes. An initial discussion was already started here.

Testing

As the test coverage of the core is not too high, we should validate the refactorings on additional data sets (and maybe try to integrate some of those in form of plain token sequences into the core). The refactorings should not change functionality, thus the similarity values should remain identical for any submission. Furthermore, as the algorithm is highly optimised for performance, we should compare the performance of the refactored implementation to the previous one with a large data set. Unfortunately, I don't have a comparison set at hand, so it would be beneficial if some of the @jplag/maintainer could help out here.

Outlook

Tokens

I refactored all state related information of tokens into the algorithm. In a future PR, we should refactor tokens to be records instead of classes. However, for this to be possible, we need to avoid subclassing tokens for every language. This will probably not be needed anymore as soon as token constants are refactored to enums.

Naming

We use firstSubmission and secondSubmission throughout the application. From my perspective, these names indicate that there might be some kind of ordering which is not. Thus, I renamed the submissions in the core algorithm to leftSubmission and rightSubmission. Another option would have been patternSubmission and textSubmission which would be closer to the original paper. I chose the former naming as the clear separation between pattern (baseline file) and text (file to detect matches in) is not given in our application context. If left / right naming is accepted, in a following PR we should adjust the naming throughout the application.

Parallel Comparison Strategy

Since tokens are now state-free, it should be possible to entirely remove the locking the parallel comparison strategy.

Jul 27 '22 12:07 JanWittler

https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag.frontend-utils/src/main/java/de/jplag/TokenList.java#L34-L45

I would like to remove the TokenList data structure entirely for which we have to get rid of this custom add functionality. The first part (ll37-39) with the file name can probably be removed. For the second part (ll40-42) I am not sure whether we require a token sequence to have monotonically non-decreasing lines. I would guess if anywhere then in the report viewer. @nestabentum do you by accident know more about that?

Jul 27 '22 13:07 JanWittler

@JanWittler all I know is that the TokenList is currently used to determine the matches of a comparison: https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag/src/main/java/de/jplag/reporting/reportobject/ReportObjectFactory.java#L156-L171

The information on matches is then further used to highlight them in the comparison view.

Jul 27 '22 13:07 nestabentum

@JanWittler all I know is that the TokenList is currently used to determine the matches of a comparison:

https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag/src/main/java/de/jplag/reporting/reportobject/ReportObjectFactory.java#L156-L171

The information on matches is then further used to highlight them in the comparison view.

Alright, thanks for the pointer. I will have a look at that.

Jul 27 '22 13:07 JanWittler

The aim of this PR is to make the core algorithm understandable by looking at the code, without reading the paper. From my perspective this is now given, thus the PR is ready for review. If you spot any parts of the algorithm that you still don't understand or would like to see more documentation for, please comment.

Jul 29 '22 12:07 JanWittler

For the second part (ll40-42) I am not sure whether we require a token sequence to have monotonically non-decreasing lines. I would guess if anywhere then in the report viewer.

@JanWittler I think these lines ensure that EOF tokens have a proper line number. As EOF tokens are specified without a line index, this might be the here to set the line index to the one of the second to last token of the file. EOF tokens cannot easily given a line number in the frontend, as the line indexing is done by the parser/scanner/compiler. Also, this might be an optimization for tokens that do not use lines and columns but only a single index. However, this is less likely.

Jul 29 '22 14:07 tsaglam

@jplag/maintainer I think there is no other PR anymore blocking this one, so it is ready to merge from my side. The end2end tests all run successfully, so I would assume that the changes do not change any results (as intended). However, if you like to validate this on a larger submission set, feel free to do so.

Aug 28 '22 19:08 JanWittler

Kudos, SonarCloud Quality Gate passed!