Open-XML-SDK
Open-XML-SDK copied to clipboard
Large parts cannot be written on .NET Core due to OutOfMemoryException
Description
I was using ClosedXML to create large excel files (100k rows with 63 colums) and I faced issue with OutOfMemoryException. I found some examples of OpenXML using SAX, so I tried to switch to OpenXML, but it didn't help me. I tried to remove all my code (reading from DB, etc.) to try if that works, but I still get OutOfMemoryException.
My code is based on http://polymathprogrammer.com/2012/08/06/how-to-properly-use-openxmlwriter-to-write-large-excel-files/
Information
- .NET Target: .NET Core 3.1
- DocumentFormat.OpenXml Version: 2.11.3
EDIT: Same code is working fine in .NET Framework 4.7.2 with same DocumentFormat.OpenXml version.
Repro This is simple code that I use at the moment. I am testing with RAM limited to 500 MB (testing purpose). I don't think that this code can consume so much RAM.
using (SpreadsheetDocument document = SpreadsheetDocument.Create(filePath, SpreadsheetDocumentType.Workbook))
{
document.AddWorkbookPart();
WorksheetPart wsp = document.WorkbookPart.AddNewPart<WorksheetPart>();
using (OpenXmlWriter writer = OpenXmlWriter.Create(wsp))
{
List<OpenXmlAttribute> oxa;
writer.WriteStartElement(new Worksheet());
writer.WriteStartElement(new SheetData());
for (int i = 0; i < 100000; i++)
{
oxa = new List<OpenXmlAttribute>();
oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));
writer.WriteStartElement(new Row(), oxa);
for (int j = 0; j < 40; j++)
{
oxa = new List<OpenXmlAttribute>();
oxa.Add(new OpenXmlAttribute("t", null, "str"));
writer.WriteStartElement(new Cell(), oxa);
writer.WriteElement(new CellValue("test"));
writer.WriteEndElement();
}
writer.WriteEndElement();
}
writer.WriteEndElement(); // end of sheetdata
writer.WriteEndElement(); //end of worksheet
}
using (OpenXmlWriter writer = OpenXmlWriter.Create(document.WorkbookPart))
{
writer.WriteStartElement(new Workbook());
writer.WriteStartElement(new Sheets());
writer.WriteElement(new Sheet() { Id = document.WorkbookPart.GetIdOfPart(wsp), SheetId = 1, Name = "Test" });
writer.WriteEndElement();
writer.WriteEndElement();
}
}
Observed
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.IO.MemoryStream.set_Capacity(Int32 value)
at System.IO.MemoryStream.EnsureCapacity(Int32 value)
at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.Xml.XmlUtf8RawTextWriter.FlushBuffer()
at System.Xml.XmlUtf8RawTextWriter.RawText(Char* pSrcBegin, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.RawText(String s)
at System.Xml.XmlUtf8RawTextWriter.WriteEndElement(String prefix, String localName, String ns)
at System.Xml.XmlWellFormedWriter.WriteEndElement()
at DocumentFormat.OpenXml.OpenXmlPartWriter.WriteEndElement()
Expected
Excel file in filePath with 100k rows and 40 columns with string "test" in all cells.
I closed the issue by mistake, so I am reopening it... I was trying to find more information about this issue and I found some issues with System.IO.Packaging... Is there some workaround to temporarily fix my problem? I really need to generate large excel files in .NET Core.
Related issues: https://github.com/dotnet/runtime/issues/23750 https://github.com/dotnet/runtime/issues/1544
This is definitely dotnet/runtime#1544. I am unsure of a good workaround at this time.
I am having the same problem. Has someone found a workaround for this issue?
Thank you in advance
Hello @twsouthwick. Did you see the workaround that was supposed in https://github.com/dotnet/runtime/issues/1544 discussion?
Hi @Viktor-36, how have you fixed or workaround this issue?I need to generate large excel files in .NET Core too.
Hi @twsouthwick, have you read the message from @nike61?
We are also running into that exact same issue, even though we call SpreadsheetDocument.Create
with a FileStream
.
Our environment is .Net 5 on Azure App service.
Our company is currently being hit hard by this issue for both WordProcessing, Presentation and Spreadsheet. We are seeing big spikes of memory consumption on our production environment when modifying the OpenXml of certain documents. This gives significant problems for our enterprise customers. Any updates or workarounds would be most appreciated.
@sorensenmatias @clement911 @rsdelapaz
Sorry for the spam, I got something working and thought perhaps it might help someone looking for a workaround. I don't love my solution but, it did work. Since I wasn't able to write to a second OpenXmlWriter
I did this (order is important):
- Create a
FileStream
(I usedFile.Create
). - Create a
Package
, pass in theFileStream
and useFileMode.Create
andFileAccess.Write
- Create a
SpreadsheetDocument
viaSpreadsheetDocument.Create
- Write your large
WorksheetPart
via anOpenXmlWriter
- Close and Dispose of the writer, the package, the file stream, etc.
- Create a
FileStream
(open this time,File.Open
withFileMode.Open
,FileAccess.ReadWrite
andFileShare.None
) - Create a
Package
, pass in theFileStream
and useFileMode.Open
andFileAccess.ReadWrite
- Create a
SpreadsheetDocument
viaSpreadsheetDocument.Open
- Create an
OpenXmlWriter
for theWorkbookPart
and add the elements forWorkbook
andSheets
, then you'll associate theSheet
you added on the original create, close and dispose of those objects and done.
What I found was I was only able to use FileAccess.Write
on the first dataset I wrote to the spreadsheet, if I tried to write anything else it would throw an exception (where ReadWrite did not, so I can only assume once I start a second writer the OpenXml library needs to read something). That's why I had to two sets of operations (the ReadWrite part at the end is very small, so no memory concerns there as it doesn't uncompress into memory the large sheet).
Outside of the acrobatics, the limitation of this approach seems to be that you can only have 1 large sheet (but from the examples I've seen, many people are only writing one large sheet per spreadsheet so this might help someone).
Here's more of a working sample for .NET Core that should get people started on the workaround. I just wrote out a 500,000 row dataset and the memory footprint stayed pretty low. There are a few places where some extension methods are used (like SafeLeft), you can remove those and put in what you need (GetCell in particular isn't super clean, keep in mind, proof of concept). What you'll be interested really is the order of the ToFile
static method and it follows the outline in the bulleted list on my last comment.
The biggest limitation I can see is that you can only write one large sheet to the document (after that ReadWrite is required, I could never get a second Write only stream to work).
https://gist.github.com/blakepell/8fe938624f1dad8c28ff93a334687d77
Thanks @blakepell, this is very helpful. Unfortunately, doing Write instead of Read/Write would require a huge refactoring (months) for our use case. So our hopes is still that the root issue here could be fixed. @twsouthwick is there any news on the matter?
One other note, with the workaround approach I posted you can add additional sheets later in ReadWrite
, they just can't be SAX based (as far as I can see). I just tested adding a second sheet with a few rows in the traditional way after the initial large dataset and the memory footprint only grew as large as that second sheet I added (again, that requires re-opening the excel document in ReadWrite
).
@blakepell 's solution gave us an error ("Cannot retrieve parts of writeonly container") until we updated to the latest version of the System.IO.Packaging
NuGet package.
Any Updates on that?
Is there any update on when it will be fixed? My .NET Core Migration for a few projects is on hold for the last two years due to this issue.
I need to generate a large excel file, and it throws the exception "Stream was too long"
Well, i was able to fix it, but i don't remember what i did, what i can say, is that back then, i was able to fix this issue doing something like this:
for some reason, if you first write the data to a WorksheetPart, close the stream, and then write the Sheet attributing the WorksheetPart to it, it runs a lot smoother
Note that im using System.IO.Packaging there, as far as i read (and remember) there was some issues with OpenXML internal Packaging on .net5 to 6, and some people show me this workaround using the native System.IO.Packaging, we need to verify that's still broken on .net 7 though
Doing this, i was able to generate aprox 1M rows excel using 1-2GiB of ram, with a high CPU usage though, but a LOT faster, but also note that im using a IAsyncEnumerable<T> in order to stream data to the file, i was using RabbitMq for streaming batches of 1000 lines each message, so its a good idea of streaming it from the database as well Here's the code i've used on the past, see if works for you, feel free to do your modifications, and let me know if you need something
/// <summary>
/// Writes .xlsx file to a given path using the provided data stream.
/// </summary>
/// <param name="filePath"></param>
/// <param name="columns"></param>
/// <param name="dataStream"></param>
public async ValueTask GenerateFile(
string filePath,
IDictionary<string, string> columns,
IAsyncEnumerable<IEnumerable<T>> dataStream)
{
var worksheetPartId = await WriteInitialDocument(filePath, columns, dataStream);
await WriteSheetToDocumentAsync(filePath, worksheetPartId);
}
private async Task<string> WriteInitialDocument(
string filePath,
IDictionary<string, string> columns,
IAsyncEnumerable<IEnumerable<T>> dataStream)
{
await using var fileStream = File.Create(filePath);
using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write);
using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook);
excel.AddWorkbookPart();
ApplyStyles(excel);
var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>();
var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart);
await WriteStreamDataToFileAsync(columns, dataStream, worksheetPart);
return workbookId;
}
/// <summary>
/// Finishes the process of writing data files.
/// </summary>
/// <param name="filePath"></param>
/// <param name="worksheetPartId"></param>
private static async Task WriteSheetToDocumentAsync(string filePath, string worksheetPartId)
{
await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None);
using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite);
using var excel = SpreadsheetDocument.Open(package);
if (excel.WorkbookPart is null)
throw new InvalidOperationException("Workbook part cannot be null!");
var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart);
xmlWriter.WriteStartElement(new Workbook());
xmlWriter.WriteStartElement(new Sheets());
xmlWriter.WriteElement(new Sheet {Id = worksheetPartId, Name = "Sheet 1", SheetId = 1});
// For Sheets
xmlWriter.WriteEndElement();
// For Workbook
xmlWriter.WriteEndElement();
xmlWriter.Close();
xmlWriter.Dispose();
}
/// <summary>
/// Streams and writes data to file.
/// </summary>
/// <param name="columns"></param>
/// <param name="dataStream"></param>
/// <param name="worksheetPart"></param>
private async Task WriteStreamDataToFileAsync(
IDictionary<string, string> columns,
IAsyncEnumerable<IEnumerable<T>> dataStream,
OpenXmlPart worksheetPart)
{
using var xmlWriter = OpenXmlWriter.Create(worksheetPart);
xmlWriter.WriteStartElement(new Worksheet());
xmlWriter.WriteStartElement(new SheetData());
var rowIndex = 1;
var attributes = new List<OpenXmlAttribute> {new OpenXmlAttribute("r", string.Empty, rowIndex.ToString())};
xmlWriter.WriteStartElement(new Row(), attributes);
var columnList = columns.Values.ToList();
foreach (var column in columnList)
{
var cellValue = column.StripHexadecimalSymbols();
var cell = new Cell {CellValue = new CellValue(cellValue), StyleIndex = 1, DataType = CellValues.String};
xmlWriter.WriteElement(cell);
}
// For Header Row
xmlWriter.WriteEndElement();
var columnKeysList = columns.Keys.ToList();
await foreach (var itemEnumerable in dataStream)
{
var itemList = itemEnumerable.ToList();
foreach (var t in itemList)
{
rowIndex += 1;
// Starts a new row.
xmlWriter.WriteStartElement(
new Row(),
new List<OpenXmlAttribute> {new("r", string.Empty, rowIndex.ToString())});
foreach (var columnKey in columnKeysList)
{
var cellValue = _cellFormatterStrategy.GetCellValue(ReportFormats.Xlsx, t, columnKey);
var cell = _cellFormatterStrategy.GetXlsxCell(columnKey, cellValue);
xmlWriter.WriteElement(cell);
}
// Closes the row.
xmlWriter.WriteEndElement();
}
}
// For SheetData
xmlWriter.WriteEndElement();
// For Worksheet
xmlWriter.WriteEndElement();
}
I'd like to get this fixed (at least a work around) for v3.0. I've created a set of abstractions that allow for more control over things and I think we could automate some of the work arounds here (at least in an opt-in way). For the abstractions, see: #1295.
My thoughts would be to model what @M4urici0GM did, but in a more transparent way. Of course, it would be better to have this fixed in the underlying package model, but that hasn't gone anywhere in too many years.
My initial thoughts to implementing this would be:
(1) Provide an abstraction of IPackage that would intercept calls to GetStream and write them to some temporary location (2) On save, first, save the package as normal (3) Then reopen the package in just write mode (this should allow the replacing of things without the explosion of memory) (4) write the streams from the temporary location (5) Close the package again and reopen with original mode/access
The abstractions I have should allow building this, except we'd need a way to "Reload" the underlying package. Building off of the abstractions, I'm thinking of enabling the following:
public interface IPackageFeature
{
IPackage Package { get; }
+ bool CanReload { get; }
+ void Reload();
}
This could automatically be supported for files opened with paths or streams, but if a package is given, then it would not be supported (since we didn't manage the package) without additional information from a user.
I made a hacky solution for this. Obviously not production-ready, but the idea itself, I think, is good. Instead of creating the Cell
object from the start, we create it on the fly and drop it right after use. Made a poc repo for this:
https://github.com/pre-alpha-final/openxml-memory-usage-hack
The results are promising. I went from 6.1GB of RAM to 2.1GB out of which around 300MB+ is just the data itself.
EDIT: Run it with @Viktor-36 's parameters form original post (100k rows, 40 cells each, with text "test"). The standard save used up 2GB or RAM, the hacky save used 321MB.
Created branches with support for the common data types and one with has the Cell
creation method as a Func. This way you're not stuck with default mapping, making this less of a hack and more of a valid workaround.
If anyone is facing the some problem nowadays - the workaround by @blakepell doesn't work in newer releases.
After some researching I found that it works with v2.16.0
(at least) - memory usage in my case (50k records with 30 column) dropped from 100Mb to 5Mb. Hope v3.1
will be able to solve that problem as well.
@twsouthwick Is this issue will be resolved in v3.0? When can we expect v3.0? We tested with beta v3, still this issue exists
Unfortunately, I wasn't able to get it to work and had other priorities. I'll move this to v3.1 to see what can be done
I was wondering the reason why blakepell's workaround can't work anymore in OpenXml 2.20 and up? It seems a validation check was added preventing the use of SpreadsheetDocument.Create(some package with write-only access, SpreadsheetDocumentType.Workbook)
.
I tried a piece of blakepell's workaround but I came across a validation check that throws if the zip package has no read access. In v2.20.0: https://github.com/dotnet/Open-XML-SDK/blob/release/v2.20.0/src/DocumentFormat.OpenXml/Packaging/OpenXmlPackage.cs:
if (package.FileOpenAccess == FileAccess.Write)
{
throw new OpenXmlPackageException(ExceptionMessages.PackageMustCanBeRead);
}
Or if I try in OpenXml 3.0.0, Package.cs checks and throws with ThrowIfWriteOnly
:
private PackageRelationshipCollection GetRelationshipsHelper(string filterString)
{
this.ThrowIfObjectDisposed();
this.ThrowIfWriteOnly();
this.EnsureRelationships();
return new PackageRelationshipCollection(this._relationships, filterString);
}
Why was this validation check added when it wasn't there in previous versions? Would removing that check in newer versions of OpenXml cause other issues?
@carlossanlop Would you have any thoughts on my comment above? https://github.com/dotnet/Open-XML-SDK/issues/807#issuecomment-1894091137
This seems related to the following dotnet issues that have been open for a long time. https://github.com/dotnet/runtime/issues/1543 https://github.com/dotnet/runtime/issues/1544 Do you have any update on these issues and when these can move forward?
These have been causing memory usage issues with MemoryStream whenever we generate a large Excel file using OpenXml, that in turn uses System.IO.Packaging (which gets down to System.IO.Compression, method OpenInUpdateMode in ZipArchiveEntry.cs which keeps uncompressed data in a MemoryStream), and we're hoping this might be resolved on the .NET side.