git-filter-repo icon indicating copy to clipboard operation
git-filter-repo copied to clipboard

Analysis reports files existing in tags as deleted

Open fdevibe opened this issue 5 years ago • 3 comments

After running git filter-repo --analyze, I am, as expected, left with a subdirectory filter-repo/analysis, containing files like path-deleted-sizes.txt. Now, after examining this file closer, I see there are files listed as deleted that exist in tags. I cannot find this explicitly mentioned in the manual, but I would assume files reported as deleted have either never existed in any given tag or branch, or have been removed from any branch or tag in which it existed. Is this assumption wrong or is this an actual issue?

fdevibe avatar Nov 17 '20 14:11 fdevibe

In the README file in the analysis directory, the following is said about deletions:

Whether a file is deleted is not a binary quality, since it can be deleted on some branches but still exist in others. Also, it might exist in an old tag, but have been deleted in versions newer than that. More thorough tracking could be done, including looking at merge commits where one side of history deleted and the other modified, in order to give a more holistic picture of deletions. However, that algorithm would not only be more complex to implement, it'd also be quite difficult to present and interpret by users. Since --analyze is just about getting a high-level rough picture of history, it instead implements the simplistic rule that is good enough for 98% of cases: A file is marked as deleted if the last commit in the fast-export stream that mentions the file lists it as deleted. This makes it dependent on topological ordering, but generally gives the "right" answer.

I cannot tell from this whether you can also expect more or less random files being marked as deleted while still being present in various branches or tags. In that case, I would argue that the claim that the results are good enough for 98% of cases is fairly bold, as the user will assume it is safe to remove these files, with fairly unexpected results.

My initial interpretation of this text was, however, that you can get "false positives", in the sense that files could be marked as present even when they are not, but that, it can seem, might not be a correct assumption.

So, is this an actual issue or is this expected behaviour? If it is indeed expected, I think it would be very nice to have a fairly visible warning in the man file and in the headings of the resulting text files stating that the files may be deleted or that they are deleted in the fast-export stream to avoid misinterpretations.

fdevibe avatar Nov 18 '20 16:11 fdevibe

I got around this for now by first caching all the files in the leaf nodes, by running git ls-tree --name-only -r on all the branches and tags and storing the contents in a dictionary, then going through path-all-sizes.txt. If the file exists in the dictionary, I mark it as present, otherwise, I mark it as deleted.

fdevibe avatar Nov 19 '20 21:11 fdevibe