Open-XML-SDK
Open-XML-SDK copied to clipboard
PresentationDocument.Save() performance weirdness
Description
I've been working on some code that searches through all the slides of a PPTX file looking for certain types of content, and modifies the XML if certain patterns are matched. I noticed that with some larger files, the time taken to save the file can be excessive (e.g. nearly 1 minute for a file under 10MB in size). Unfortunately I can't share files that perform really badly, but I have managed to demonstrate some anomalies with a file I generated myself.
Information
- .NET Target: .NET Framework 4.7.2
- DocumentFormat.OpenXml Version: 2.14.0
Repro
Using this sample file: map.pptx (This file was generated by downloading the PPTX file from this page, then copying the last slide ("Editable World Map") into a new blank presentation with PowerPoint 2019 and copying and pasting the slide 17 times for a total of 18 copies)
using System;
using System.Diagnostics;
using System.IO;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Presentation;
namespace ReproOxmlSaveSlowness
{
class Program
{
static void Main(string[] args)
{
var sourceFileName = "map.pptx";
var tempFileName = "temp.pptx";
// Make a temporary copy of the original file.
File.Copy(sourceFileName, tempFileName, true);
// Open the file read/write, and change the name of the first slide.
using (var doc = PresentationDocument.Open(tempFileName, true))
{
var firstSlideId = doc.PresentationPart.Presentation.SlideIdList.FirstChild as SlideId;
var firstSlidePart = doc.PresentationPart.GetPartById(firstSlideId.RelationshipId) as SlidePart;
firstSlidePart.Slide.CommonSlideData.Name = "Some slide name";
var stopwatch1 = new Stopwatch();
stopwatch1.Start();
doc.Save();
stopwatch1.Stop();
Console.WriteLine($"Time to save after changing 1 slide: {stopwatch1.Elapsed.TotalSeconds} sec");
}
// Make a fresh temporary copy of the original file.
File.Copy(sourceFileName, tempFileName, true);
var stopwatch2 = new Stopwatch();
int numSlides = 0;
// Open the file read/write, this time just reading the names of all slides without
// modifying anything.
using (var doc = PresentationDocument.Open(tempFileName, true))
{
foreach (SlideId slideId in doc.PresentationPart.Presentation.SlideIdList)
{
var slidePart = doc.PresentationPart.GetPartById(slideId.RelationshipId) as SlidePart;
Console.WriteLine($"Slide name: {slidePart.Slide.CommonSlideData.Name}");
++numSlides;
}
var stopwatch3 = new Stopwatch();
stopwatch3.Start();
doc.Save();
stopwatch3.Stop();
Console.WriteLine($"Time to save after looking at {numSlides} slides: {stopwatch3.Elapsed.TotalSeconds} sec");
stopwatch2.Start();
}
stopwatch2.Stop();
Console.WriteLine($"Time to close after looking at {numSlides} slides: {stopwatch2.Elapsed.TotalSeconds} sec");
File.Delete(tempFileName);
}
}
}
Observed
On my (admittedly rather elderly) development PC, I see the following output:
Time to save after changing 1 slide: 0.6335861 sec
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Slide name:
Time to save after looking at 18 slides: 4.8866107 sec
Time to close after looking at 18 slides: 4.7443786 sec
Expected
There were a few surprises for me here:
- I expected that after merely accessing (and not writing to) all 18 slides, saving would effectively be a no-op (as no content has been modified). However, it seems a lot more time (almost 8x) is being spent saving the unmodified (but more accessed) document than the document with a small modification to one slide. This suggests that
Save()is saving every part that has been accessed at all, even if not modified. - In the case where all 18 slides are touched, it seems excessive for the
Save()operation to take nearly 5 seconds, when PowerPoint 2019 is able to save the same file in well under a second on the same machine. - Closing the document (implicitly, by disposing it) took roughly as long as saving, suggesting that the SDK is saving everything again, rather than only saving parts that have changed since the last
Save()operation, as I expected.
Point 3 is simple for me to work around - I can just avoid calling Save() explicitly and allow the document to save itself when disposed, so I'm not personally concerned about that but I doubt I'm the only person to be surprised by it so I thought it was worth mentioning.
I can also work around Point 1 by making two passes through the presentation (open it read-only to scan an make a note of content that needs to be updated, then close and reopen it read/write to make the necessary changes). However, it seems like implementing a "dirty" flag (if feasible) in the OXML SDK for parts that have actually been modified would be more efficient. This would also address Point 3, if the "dirty" flags were reset as part of the Save() operation.
Point 2 is more of a general observation - anecdotally, the SDK seems to be slow at saving documents in cases where a large number of parts (e.g. hundreds) need to be updated, particularly when large overall file sizes (tens or hundreds of megabytes) are involved.