Open-XML-SDK icon indicating copy to clipboard operation
Open-XML-SDK copied to clipboard

Split assembly in separate packages for DOCX / XLSX / PPTX

Open m-gallesio opened this issue 2 years ago • 11 comments

Per https://github.com/dotnet/Open-XML-SDK/issues/387

Consider breaking up library into smaller ones for Word/Powerpoint/Excel.

As well as my comment: https://github.com/dotnet/Open-XML-SDK/issues/387#issuecomment-1140978211

Yes, please. Having these separated (along with a base / core module, I assume) would improve clarity, make for a smaller base package (currently ~5.8 MB) and allow more code splitting between dependent assemblies. For reference our product does not currently use Powerpoint at all.

I know this topic is in the backlog, but considering its structural impact I think it it worth it to have a dedicated issue, at least to improve its visibility and prevent it from getting buried.

(I have searched for similar issues but if I could not find any; if this is indeed explicitly tracked elsewhere please close / delete this issue).

m-gallesio avatar Jan 18 '23 13:01 m-gallesio

@m-gallesio this has been a long time in the making. I've currently separated out the core SDK infrastructure and enabled the typed classes to be in a separate assembly from all of that. I'm currently working on stabilizing it, but as that occurs, this becomes possible.

@tomjebo This is something we should start looking at from the schema perspective. Can we update the generator to keep these isolated? I know there are a few types that are explicitly shared between the types for some reason (I think similar names) that we would need to separate out.

twsouthwick avatar Jan 18 '23 23:01 twsouthwick

@m-gallesio @czemacleod If there's interest here, the code generator has been moved to this repo. It may be able to isolate the types required for each doc type using some sort of tree-walking. Happy to take any pull requests or have a discussion of how to start tackling this if anyone is interested in contributing!

twsouthwick avatar Jan 18 '23 23:01 twsouthwick

@twsouthwick I'm not sure I really have time to delve that far into this project unfortunately. My reasoning is that we use this to build Excel files for reports and 'datagrids' in our web applications (as an export to Excel feature). We also use it in a couple of applications to do data import from Excel. The point is that I have no need to any other types (other than common/base types) outside that scope and if we could reduce the memory footprint of the application all the better. Not for ourselves especially, but this would also make a lot of sense in a microservices scenario for e.g. generating word documents from templates, or excel spreadsheets or something similar.

CZEMacLeod avatar Jan 23 '23 19:01 CZEMacLeod

I've added this to the v3.0 milestone so it can be tracked there - otherwise, it will probably be until v4.0 (whenever that ends up happening)

twsouthwick avatar Feb 03 '23 17:02 twsouthwick

I have tried to follow the generation pipeline without much success (is there some kind of guide / reference?), but it seems the input for the generator is the data directory.

Files in data/schemas are already split by XML namespace. Files in data/parts seem to work differently, but have ContentType and RelationshipType fields which seem interesting for this purpose.

Naïvely, a valid idea seems to be having somewhere, somehow a mapping between namespaces (either original XML or generated C#) and assemblies; e.g. "all files in the Wordprocessing namespace should be in the Wordprocessing assembly".

m-gallesio avatar Feb 11 '23 19:02 m-gallesio

Thanks @m-gallesio for taking a look.

Naïvely, a valid idea seems to be having somewhere, somehow a mapping between namespaces (either original XML or generated C#) and assemblies; e.g. "all files in the Wordprocessing namespace should be in the Wordprocessing assembly".

That's what I'm thinking as well. However, There may be some shared types/namespaces (i.e. DrawingML) that may have to be an additional library. If you can come up with a list of those, that would be a great starting point :)

twsouthwick avatar Feb 23 '23 00:02 twsouthwick

I looked into this and identified a few things that would be necessary:

  • [ ] A single generated file is used for all the root part generation - this would need to be separated out to be per-document type
  • [ ] TypedOpenXmlPartReader was introduced in 2.19 as a type that provides the typed factories for use of OpenXmlPartReader with just a stream. This would need to be replaced with a doc specific one

twsouthwick avatar Mar 01 '23 02:03 twsouthwick

Looking even more into this, I don't know if we want to necessarily touch DocumentFormat.OpenXml, instead, we can start generating a second set of types that are structured a bit better to allow this. That way, we won't break anything in the original usage.

This would also allow #1278 to rationalize the namespace hierarchy of the generated types, as well as automating to the new layout with either analyzer/codefixers or Upgrade Assistant.

Also, by doing this, we can do it at any point in the 3.x timeline and won't be a breaking change

twsouthwick avatar Mar 09 '23 23:03 twsouthwick

Glad to see this is moving forward even if I am not able to follow the internals. If I may deliver a small rant, another point I have stumbled into today which would be helped by this split is the existence of several homonymous classes across different namespaces. Accidentally (or automatically via IDE) importing Paragraph from DocumentFormat.OpenXml.Spreadsheet while dealing mainly with Wordprocessing can be quite annoying.

m-gallesio avatar Mar 14 '23 15:03 m-gallesio

Yeah, long term, I hope to make the source generators we're using public and then allow you to customize the names how you'd like. Something akin to how CsWin32 allows you to specify what you care about and it will just add what you need. We're a ways off from that, though.

However, there are names that are reused in different namespaces by the schema definitions, so this just is an artifact of that.

twsouthwick avatar Mar 28 '23 22:03 twsouthwick

I'm removing this from v3.0 as we can tackle it after, but once v3.0 is stabalized, it'll probably be the next thing I focus on

twsouthwick avatar Mar 28 '23 23:03 twsouthwick