templating
templating copied to clipboard
Improve processing of different file formats when generating content
Background
Now, when processing the files template engine treat all the files (unless specified as copyOnly) as:
- textual
- trying to perform replace and other operations This causes performance issues and accidentally errors when generating the content when the content of the file matches the replacement.
Suggestion
Improve detection of the file format and the rules upon that. Not limited to:
- enrich the defaults for "copy-only", for example in web templates libs are often included, etc
- if the file is detected binary, do "copy-only" mode by default
- if the file is well-known textual type, process operations in it
Ways for detection:
- probe first bytes to detect binary template
- https://www.nuget.org/packages/EmptyFiles
Consider same rules in template testing framework for scrubbing.
Relevant comment https://github.com/dotnet/templating/issues/5789#issuecomment-1357469106 by @JanKrivanek
As for distinguishing textual and binary files - it's normally not that easy (usually done by performing characters frequency analysis for various byte encoding lengths over a sample of the file and deciding whether the encoding seems to be match based on low occurence of nonprintable chars). But there are couple of simplifying options (we can combine them):
- We can call or borrow IsText functionality from VerifyTests/EmptyFiles
- Leveraging existing third party libraries solving this purpose: https://github.com/tylerlong/ude, https://github.com/superstrom/chardetsharp etc.