eclipse.platform Improve workspace metadata file access

The folder .metadata\.plugins\org.eclipse.core.resources\.projects contains multiple super small files per project. Saving and loading small files on windows is inefficient. It costs needless performance (and has a overhead on filesystem due to blocksize). For example the ~1000 projects in platform workspace need multiple seconds to load (while splash screen is shown). Startup time could be improved by loading those files in parallel (see https://github.com/eclipse-platform/eclipse.platform/pull/1219) but it would help much more to just store all that information in a single - per workspace - file. Before i start such a PR i would like to know if there is any concern. On platform workspace the total size of those files is <10MB and would easily fit into a single file that is completely loaded into memory all at once. I also checked our adopters workspace which shows similar statistics.

I would start with combining all .markers and .location files into a single such file for each type.

Mar 01 '24 10:03 jukzi

This API org.eclipse.core.resources.IProject.getWorkingLocation(String) provides general access of the things located under .projects and individual downstream technologies mange those. I assume those aren't all read on startup and I assume you're not trying to do anything to change this.

I have a bit of a bad feeling about changing such a distributed design. Naturally when you change something that's working correctly there is a significant likelihood that afterwards it will no longer work correctly, or it will just work differently when the different performance impact. E.g., what will be the impact of closing and opening projects? Are we trading off startup performance of reading many small files to the performance impact of repeatedly saving a single much large file in order to make a small changes each time while the IDE is running? I'd rather wait a little longer to start than to wait a little longer all the time. (There is a good chance that opening the workspace with a new IDE immediately makes it incompatible with and older IDE; yes, I know that's generally a concern and there is a warning for that.)

Mar 01 '24 11:03 merks

If everything is stored in one file then if the data of just a single project is broken all projects might be affected. Also, these files may all have different content formats.

Mar 01 '24 11:03 tomaswolf

Currently when a single file is broken the restore would just abort there anyway.

getWorkingLocation(String) could stay the same but not be used anymore for .markers and .location files

I would also be interested to see results from other computers if you enable that the related tracing:

| main | 2024-03-01 13:32:03.961 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore Markers for workspace: 1709ms |
| main | 2024-03-01 13:32:04.013 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore workspace metainfo: starting... |
| main | 2024-03-01 13:32:04.039 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /.org.eclipse.egit.core.cmp: 25ms |
| main | 2024-03-01 13:32:04.043 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /.org.eclipse.jdt.core.external.folders: 3ms |
| main | 2024-03-01 13:32:04.045 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /aa: 2ms |
| main | 2024-03-01 13:32:04.047 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /converterJclMin: 2ms |
| main | 2024-03-01 13:32:04.050 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /converterJclMin1.5: 2ms |
| main | 2024-03-01 13:32:04.052 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /converterJclMin1.7: 2ms |
| main | 2024-03-01 13:32:04.054 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /converterJclMin1.8: 2ms |
| main | 2024-03-01 13:32:04.056 | org.eclipse.core.resources | /debug | org.eclipse.core.internal.utils.Policy | debug | 142 | org.eclipse.core.internal.jobs.ThreadJob(Implicit Job): Restore metainfo for /converterJclMin10: 2ms |
...

i.e. average of 2ms per file sum up to 2 Seconds for 1000 files While reading/writing a single file takes only a few ms too even if it is 10MB.

Somehow it feels like the virusscanner has concerns specially about this kind of files.

Mar 01 '24 12:03 jukzi

One thing I could think about is having an index file, that summarize all items into one big file, it could work that way:

If on startup there is no index file, read all little files and create a index from it
If on startup there is an index file read it and use that information if it fails treat it as like the index was not there.
Every change simply deletes the index file

That way it will only be a bit slower on first start where something has changed but faster on repeated starts.

That would also be compatible with older workspaces.

Mar 01 '24 12:03 laeubi

I figured out the root problem: Normally a file access on my computer is ~40us per file - no matter if windows on access scanner is activated or not. But when the files have been touched the next open will take 100 times as long. The files are written whenever i shut down eclipse. So whenever i start eclipse they are slow to read. May be there is a way to not touch the file if the content did not change... i.e. modify the write instead of the read :-)

Mar 08 '24 16:03 jukzi

Plexus has org.codehaus.plexus.util.io.CachingOutputStream that only writes once it finds a different byte in the stream, I though eclipse was doing the same but never checked that in detail.

Mar 09 '24 05:03 laeubi

@jukzi

I wonder if the following very simple isolated change to org.eclipse.core.internal.localstore.SafeFileOutputStream.commit() might address the primary concerns raised in this issue:

	private static long SMALL_FILE_LENGTH = Long.parseLong(System.getProperty("foo", "1000")); //$NON-NLS-1$ //$NON-NLS-2$

	protected void commit() throws IOException {
		if (!temp.exists())
			return;
		Path tempPath = temp.toPath();
		Path targetPath = target.toPath();
		boolean copy = true;
		long length = temp.length();
		if (length < SMALL_FILE_LENGTH && target.exists() && target.length() == length) {
			try {
				copy = !Arrays.equals(Files.readAllBytes(tempPath), Files.readAllBytes(targetPath));
			} catch (IOException ex) {
				//$FALL-THROUGH$
			}
		}
		if (copy) {
			Files.copy(tempPath, targetPath, StandardCopyOption.REPLACE_EXISTING);
		}
		temp.delete();
	}

I.e., we have some system property that defines a cutoff size for "small files". We will only perform significant additional work for "small files". The logic checks if the two files have the same length, and then checks if they have the same content bytes to avoid the copying in that case. We might even modify this idea to check if SMALL_FILE_LENGTH <= 0 to avoid calling temp.length such that someone can completely disable all the overhead.

Questions:

Does this help address the issue?
What should be the (default) value small size cutoff?
Do we really need a system property to allow that value to be configured?
Does this raise any new performance concerns?

Mar 28 '24 09:03 merks

@merks the Files.readAllBytes(tempPath) would read a file that was just touched i.e. waits for virus scan , a solution needs to avoid writing the temp file.

Apr 04 '24 09:04 jukzi

As per design / contract of this class writing the temp file is unavoidable, the only possible thing would be to keep the bytes while they are written one can extend BufferedOutputStream then you can access the buffer + position then you would have the bytes at hand without reading it again and one only need to delete the temp file if it equals the bytes that are already there.

Apr 04 '24 09:04 laeubi

The write happens on close, and as suggested by Christoph, this is part of being "safe". Also, it was my understanding that you were primarily focused on performance tuning on startup not performance tuning the closing of an IDE, which I suspect no one much cares about and for the tiny chance that the file sizes are the same but the contents are different, the tiny bit of overhead is too tiny to care. But I could have overlooked something...

Apr 04 '24 09:04 merks

If one looks on how BufferedOutputStream is implemented one can see content is even not flushed to disk at all so if one handle it smart enough (e.g. overwrite close method) this will write/delete a file of byte length 0 what should not kick in any virus scanners.

Apr 04 '24 10:04 laeubi

So whenever i start eclipse they are slow to read. May be there is a way to not touch the file if the content did not change... i.e. modify the write instead of the read :-)

If your analysis is correct, then additionally reading the files after writing them during shutdown should make the startup much faster (on cost of making the shutdown slower), without any changes to how the files are managed as such.

Apr 04 '24 11:04 Bananeweizen

@merks first of all: thanks for trying to find a simple solution. I confirm that the trick of reading only the file length first is a cool idea. The pure file attribute reading indeed returns before virus scanner kicks in. My benchmark was restarting eclipse - which includes stop and start (in that order). Also the benchmark is to have the files unchanged (which i assume is the most common case). In that case applying your patch the temp file read kicks in the virus scanner - after the additionally slowed down by file attribute access. I tried to get construct a jmh microbenchmark to get further insight what takes exactly how long. However normal benchmarks assert a warmup. In this particular application the JVM has no chance to significantly warm up, because it starts/stops only once. So i could only get rough numbers, which should nor be seen as absolute numbers, but gives indication how expensive operations are (average of 1000 iterations, windows 10, jdk 22):

Reading length of file: ~40us Reading a file that was already read: ~80us Writing a new (temp) file: ~360us Deleting existing file: ~200us Renaming a new (temp) file: ~270us Overwriting an existing file which was already read: ~230us Reading a file that was not already read: ~1000us

Even though i agree that stop time is not as important as start time i don't see a good reason why we should sacrifice stop time when it is possible to skip writing, renaming AND deleting the tempfile if the content did not change. Also note that @iloveeclipse's was only worried about stop time.

Apr 04 '24 12:04 jukzi

My concern is and was any additional IO for metadata, independently if on start or on stop. We saw in the past that workspaces with the metadata located on NFS (typical Linux installations with home mounted on server) both startup time and shutdown time may be heavily affected by reading or writing metadata. So if a customer expect application to shutdown in a minute, and it hangs writing metadata even longer, they tend to kill application and that in turn makes startup times even worse, because recovering from a crash requires applying all the previously not properly saved data first.

So we shouldn't sacrifice neither startup nor shutdown times. Whatever solution is chosen shouldn't be worse as today for both aspects.

Apr 04 '24 12:04 iloveeclipse

An additional constraint: The solution should also not have the potential to cause an of memory exception.

I'm a bit doubtful that it's possible to determine if a file's contents are changed or not with zero overhead, which suggests that no solution can meet the acceptance criteria. One solution was creating a checksum, which is not zero cost, so that seems to be excluded, and that still involved double checking the actual on-disk contents when the checksums match, which is also zero cost (and the same cost as in the "simple" solution proposed here).

Note my comment about using a system property to determine the "small size" and potentially using the value <= 0 to simply do nothing. It could even default to zero so that @iloveeclipse and everyone else is opted out by default. The EPP packages, and the SDK, could set it to a non-zero. That seems a reasonable compromise, but I'm not sure if folks are inclined to compromise anything. (I don't want to digress, but on a related front, I recall very recently that @vogella wanted to eliminate the progress monitor that displays during shutdown because "it just flashed by," so in general EPP IDE shutdown appears to be a non-issue. I wouldn't want to make a general statement about all Eclipse application and their startup and shutdown needs though!)

In any case, I think an acceptance criteria that allows zero cost pretty much precludes a solution. Someone else is welcome to try to implement such a solution but they should do so knowing how poorly all the previous attempts have gone over...

Apr 04 '24 12:04 merks

so in general EPP IDE shutdown

SDK not EPP and only on Linux. On Windows it shuts down slow enough to justify a dialog

Apr 04 '24 13:04 vogella

so in general EPP IDE shutdown

SDK not EPP and only on Linux. On Windows it shuts down slow enough to justify a dialog

I sometimes have 15 or more IDEs open, and then when I need to restart Windows, I shut down each one in succession, so that they are cleanly shutdown. Sometimes I see the dialog for ~2 seconds, though mostly it does just flash by...

Apr 04 '24 14:04 merks

An additional constraint: The solution should also not have the potential to cause an of memory exception.

Maybe someone could give serious data how his markers metadata could not fit into memory multiple times so that we talk about a solution? My files are a few KB or few MB all together while i have plenty GB free memory... . If we agree the data does fit into memory we could even compress the file content - if big - so that @iloveeclipse's usecase of big content over slow network is even improved? To my experience .markers file compress very well. Example:

Apr 04 '24 14:04 jukzi

We saw marker files of 20 MB and bigger, up to 200 - 300 MB. It could be it is compressed better in latest SDK, the problem we've investigated was around 4.21 or 4.25, can't say now exactly. The .marker files grow incrementally over the session and depending on the workspace size / markers in the workspace / number of builds (marker changes). I'm pretty sure one can check where it is going by using a workspace with ~200.000 files and have few markers per file. Do few clean builds and watch .markers file sizes. This is a "medium" big workspace I would say.

Apr 04 '24 14:04 iloveeclipse

The content of the .markers file is the same of what is shown in the problems view. A 200 MB file for a 200 k files workspace gives an average of 1000 bytes per file, which is plausible. Can we also assume that you have another GB of free memory in such a workspace? If not - what happens if you press sort in the Problems view? It also duplicates all entries for sorting.

Apr 04 '24 15:04 jukzi

In our case the problem is not the heap, but NFS. We run with defaults starting with 12 GB and max 31 GB, depending on workstation RAM installed (64 - 256 GB). I don't have the original workspace from customers anymore, but I assume one can pretty easy create workspace of any size with any number of markers via project generator, like here 11.000 files with 4 markers each: https://github.com/iloveeclipse/java-project-generator/tree/manyWarnings.

Apr 04 '24 15:04 iloveeclipse

So anybody else having trouble with Heap? Or can we agree that Heap is not a problem for markers?

Apr 04 '24 15:04 jukzi

The problem with SDK is that you don't know how it is used. It can be an "embedded" Linux with 2GB RAM and any extra ~5MB might be too much, where in our case we start with 12 GB memory for one process alone...

Apr 04 '24 15:04 iloveeclipse

eclipse.platform eclipse.platform copied to clipboard

Improve workspace metadata file access

eclipse.platform
eclipse.platform copied to clipboard