templating Improve processing of different file formats when generating content

Improve processing of different file formats when generating content

Open vlada-shubina opened this issue 2 years ago • 1 comments

Background

Now, when processing the files template engine treat all the files (unless specified as copyOnly) as:

textual
trying to perform replace and other operations This causes performance issues and accidentally errors when generating the content when the content of the file matches the replacement.

Suggestion

Improve detection of the file format and the rules upon that. Not limited to:

enrich the defaults for "copy-only", for example in web templates libs are often included, etc
if the file is detected binary, do "copy-only" mode by default
if the file is well-known textual type, process operations in it

Ways for detection:

probe first bytes to detect binary template
https://www.nuget.org/packages/EmptyFiles

Consider same rules in template testing framework for scrubbing.

Dec 19 '22 09:12 vlada-shubina

Relevant comment https://github.com/dotnet/templating/issues/5789#issuecomment-1357469106 by @JanKrivanek

As for distinguishing textual and binary files - it's normally not that easy (usually done by performing characters frequency analysis for various byte encoding lengths over a sample of the file and deciding whether the encoding seems to be match based on low occurence of nonprintable chars). But there are couple of simplifying options (we can combine them):

We can call or borrow IsText functionality from VerifyTests/EmptyFiles
Leveraging existing third party libraries solving this purpose: https://github.com/tylerlong/ude, https://github.com/superstrom/chardetsharp etc.

Dec 19 '22 12:12 vlada-shubina

templating templating copied to clipboard

Improve processing of different file formats when generating content

Background

Suggestion

templating
templating copied to clipboard