bfg-repo-cleaner
bfg-repo-cleaner copied to clipboard
Add --blob-exec to run system commands for each blob
This is a rebase of Paul Draper's implementation of blob-exec: https://github.com/rtyley/bfg-repo-cleaner/pull/83
Is there any chance to merge this? I'm currently using it to replace tabs by spaces exactly like Paul mentioned it under his use cases.
I'm interested in whether there are some metrics that show how much faster this is than running the equivalent filter-branch, if anyone had those numbers.
It doesnt work for me
What I am missing?
bfg2="java -jar /home/ubuntu/bfg-repo-cleaner/bfg/target/bfg-1.12.4-SNAPSHOT-paul-blob-exec-234ba67.jar"
Exception in thread "main" java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.io.IOException: Stream closed
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.madgag.git.bfg.MemoUtil$$anonfun$concurrentCleanerMemo$1$$anon$1.apply(memo.scala:60)
at com.madgag.git.bfg.GitUtil$$anon$1.apply(GitUtil.scala:69)
at com.madgag.git.bfg.CleaningMapper$class.replacement(GitUtil.scala:44)
at com.madgag.git.bfg.GitUtil$$anon$1.replacement(GitUtil.scala:68)
at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1$$anonfun$2.apply(ProtectedObjectDirtReport.scala:44)
at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1$$anonfun$2.apply(ProtectedObjectDirtReport.scala:44)
at scala.util.Either.fold(Either.scala:99)
at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1.apply(ProtectedObjectDirtReport.scala:44)
at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1.apply(ProtectedObjectDirtReport.scala:42)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:174)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
at scala.collection.SetLike$class.map(SetLike.scala:92)
at scala.collection.AbstractSet.map(Set.scala:47)
at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$.reportsFor(ProtectedObjectDirtReport.scala:42)
at com.madgag.git.bfg.cleaner.CLIReporter.reportProtectedCommitsAndTheirDirt(Reporter.scala:113)
at com.madgag.git.bfg.cleaner.CLIReporter.reportObjectProtection(Reporter.scala:86)
at com.madgag.git.bfg.cleaner.RepoRewriter$.rewrite(RepoRewriter.scala:94)
at com.madgag.git.bfg.cli.Main$$anonfun$1.apply(Main.scala:59)
at com.madgag.git.bfg.cli.Main$$anonfun$1.apply(Main.scala:34)
at scala.Option.map(Option.scala:146)
at com.madgag.git.bfg.cli.Main$.delayedEndpoint$com$madgag$git$bfg$cli$Main$1(Main.scala:33)
at com.madgag.git.bfg.cli.Main$delayedInit$body.apply(Main.scala:27)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.madgag.git.bfg.cli.Main$.main(Main.scala:27)
at com.madgag.git.bfg.cli.Main.main(Main.scala)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Stream closed
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.madgag.git.bfg.MemoUtil$$anonfun$concurrentCleanerMemo$1$$anon$1.apply(memo.scala:60)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.immutable.List.map(List.scala:285)
at com.madgag.git.bfg.cleaner.TreeBlobModifier$class.apply(TreeBlobModifier.scala:38)
at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.apply(CLIConfig.scala:181)
at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.apply(CLIConfig.scala:181)
at scala.Function$$anonfun$chain$1$$anonfun$apply$1.apply(Function.scala:24)
at scala.Function$$anonfun$chain$1$$anonfun$apply$1.apply(Function.scala:24)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:136)
at scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
at scala.Function$$anonfun$chain$1.apply(Function.scala:24)
at com.madgag.git.bfg.cleaner.ObjectIdCleaner$$anonfun$4.apply(ObjectIdCleaner.scala:124)
at com.madgag.git.bfg.cleaner.ObjectIdCleaner$$anonfun$4.apply(ObjectIdCleaner.scala:118)
at com.madgag.git.bfg.MemoUtil$$anon$3.load(memo.scala:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
... 41 more
Caused by: java.io.IOException: Stream closed
at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)
at java.io.OutputStream.write(OutputStream.java:116)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at com.madgag.git.bfg.cleaner.BlobExecModifier$class.fix(BlobExecModifier.scala:25)
at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.fix(CLIConfig.scala:181)
at com.madgag.git.bfg.cleaner.TreeBlobModifier$$anonfun$1.apply(TreeBlobModifier.scala:32)
at com.madgag.git.bfg.cleaner.TreeBlobModifier$$anonfun$1.apply(TreeBlobModifier.scala:31)
at com.madgag.git.bfg.MemoUtil$$anon$3.load(memo.scala:74)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
... 67 more
I'm interested in whether there are some metrics that show how much faster this is than running the equivalent filter-branch, if anyone had those numbers.
Assuming that all files are replaced with new versions, this is my estimation of the cost for each.
given my experience with this branch (running on linux) and filter-branch, my view is that the savings are a function that depends:
- On the number of commits C
- On the number of unique files in the project UF
- On the number of files replaced in a given commit
- On the number of files that exist in the file system at any given time
- the speed of the file system,
- the time it takes to process each individual file.
With BFG, the cost of the replacement is a function of (k1 * C + k2 * UF)/(k3*SpeedFileSystem)
With filter-branch, this is what happens:
For every commit:
- checkout and replace files that are different in the file system with respect to their contents according to the commit
- run filter branch on each file (no duplicates detected)
- create new commit
For example, say a filter replaces the contents of every file in every revision.
For revision 1 the contents are checked-out. Lets say we have n1 files. n1 files are replaced with new versions. Commit these changes. Now, this is the true expensive operation: checkout the next commit. This means, replacing every single file in the tree (even those that were not part of the actual changes) with their version according to the commit. Now we have to process again every single file in the repo (they are all dirty). After they are all processed, git will realize that only the files that were in commit 2 are actually changed (with respect to the processed version of commit 1)
This process basically makes filter-branch cost include:
- cost of recreating every file in every revision (cost of checking out and writing files in repo--avg files in repo * # of commits)
- cost of processing every file in every revision (cost of processing: avg files in repo * # commits)
so, bfg is proportional to the avg number of files in commit multiplied by the number of commits (basically, the number of unique BLOB files found in a repo) while the cost of filter-branch is proportional to th avg number of files IN repo multiplied by number of commits.
Let's say we can do 100 file processed operations per second. And we have 1M commits, with an average of 10k files in the repo, and 10 files modified per commit (I am thinking linux here).
Let us assume that the cost of checking out the files (filter-branch) and processing the commits (bfg and filter-branch) is neglegible (it is not, but bear with me). if my numbers are right, it would take:
(1M * 10k)/100 seconds to process this repo => 1157 days => 3 years
with BFG it would take:
(1M * 10) /100 seconds => 27 hrs.
so, in conclusion: BFG processing time is Order(#commits *#avgNumberOfFilesPerCommit) while the filter-branch processing time is Order(#commits * #avgNumberOfFilesInRepo).
May I be allowed to use and learn