templating icon indicating copy to clipboard operation
templating copied to clipboard

Improve processing of different file formats when generating content

Open vlada-shubina opened this issue 2 years ago • 1 comments

Background

Now, when processing the files template engine treat all the files (unless specified as copyOnly) as:

  • textual
  • trying to perform replace and other operations This causes performance issues and accidentally errors when generating the content when the content of the file matches the replacement.

Suggestion

Improve detection of the file format and the rules upon that. Not limited to:

  • enrich the defaults for "copy-only", for example in web templates libs are often included, etc
  • if the file is detected binary, do "copy-only" mode by default
  • if the file is well-known textual type, process operations in it

Ways for detection:

  • probe first bytes to detect binary template
  • https://www.nuget.org/packages/EmptyFiles

Consider same rules in template testing framework for scrubbing.

vlada-shubina avatar Dec 19 '22 09:12 vlada-shubina

Relevant comment https://github.com/dotnet/templating/issues/5789#issuecomment-1357469106 by @JanKrivanek

As for distinguishing textual and binary files - it's normally not that easy (usually done by performing characters frequency analysis for various byte encoding lengths over a sample of the file and deciding whether the encoding seems to be match based on low occurence of nonprintable chars). But there are couple of simplifying options (we can combine them):

  • We can call or borrow IsText functionality from VerifyTests/EmptyFiles
  • Leveraging existing third party libraries solving this purpose: https://github.com/tylerlong/ude, https://github.com/superstrom/chardetsharp etc.

vlada-shubina avatar Dec 19 '22 12:12 vlada-shubina