smoke icon indicating copy to clipboard operation
smoke copied to clipboard

Regex support

Open jonaprieto opened this issue 2 years ago • 16 comments

Hey

I didn't see it any support for testing the output against a regex expression in the documentation. Sorry if I miss it, but I consider it essential. One of the reasons I'm still using ShellTestRunner.

jonaprieto avatar Aug 03 '22 09:08 jonaprieto

I think this is an excellent idea.

Could you give me an example of the kind of thing you need to run a regex over? Please be as specific as possible: it would be helpful in designing such a feature to have some example input, the expected output, and the kind of regex you'd need.

SamirTalwar avatar Aug 04 '22 04:08 SamirTalwar

Let's imagine I want to check the stderr output of a given command containing a particular pattern.

For example, let's say I want to check if an error in a specific location occurs while running MyTool. The error given by this tool follows a particular form. Then, I would like to write something like the following where I can check the error happens effectively at line 11, cols 7-8.

command : MyTool
inputFile: tests/negative/MyFile.X
stderr : "(.+)\/([^\/]+)\.X\:11\:7\-8\: error.*"

PS. The error pattern in the example is used by GHC compiler, if I'm not mistaken.

jonaprieto avatar Aug 04 '22 09:08 jonaprieto

That seems quite sensible. You could run a filter over it to remove the unimportant parts but that might make a test failure message very confusing.

How do you feel about this kind of syntax?

command:
  - MyTool

tests:
  - name: negative
    args:
      - tests/negative/MyFile.X
    exit-status: 1
    stderr:
      match: '(.+)/([^/]+)\.X:11:7-8: error.*'

(You can probably get away with fewer escapes if we use Perl-compatible regular expressions and don't need to support Perl- or sed-style syntax.)

You do lose one nice feature, in that you can't show a diff when the match is a regex, but as long as you're using it on short outputs, I don't think that's an issue.

SamirTalwar avatar Aug 04 '22 10:08 SamirTalwar

That syntax seems excellent. Fewer escapes for a regex, the better. Another possible option that may become handy is if we specify if the match has to occur in the whole output or a line. Maybe, something like this. But for now, the initial proposal sounds fantastic!

...
    exit-status: 1
    stderr:
      match: 
        inline: '^(.+)/([^/]+)\.X:11:7-\d+: error.*'
      
...

jonaprieto avatar Aug 04 '22 10:08 jonaprieto

I suggest we make use of PCRE's native functionality. You can use \A and \Z to match the start and end of the whole text ("subject" in the docs), or if multiline mode is enabled, you can use ^ and $ for the start and end of a line.

I suggest a syntax along the lines of:

match: <pattern>

or:

match:
  pattern: <pattern>
  flags:
    - multiline  # or `m`
    - caseless   # or `i`
    - etc.

The flags are specified in the pcrepattern specification.

We won't have a way to match against multiple regular expressions and and the result (Smoke supports or through a choice mechanism, but not and). I think I'm OK with this until someone comes up with a use case.

I can probably take a crack at this sometime in the next week, though I make no guarantees as to when I will finish it. If you'd like to give it a go, you're more than welcome. What would you prefer?

SamirTalwar avatar Aug 04 '22 14:08 SamirTalwar

Just to give you an update: in trying to implement this, I have found a few bugs in edge cases when interpreting the YAML, which are exacerbated by trying to add another type of matcher. So it's taking longer than expected. (I'm also not working on this full-time, as you might have guessed.)

Long story short, probably don't implement even a simple expression language in YAML.

I hope I can get this done soon-ish, but there is of course the family, who come first, and the day job, who pay me, so please don't hold your breath on it happening in the next couple of weeks.

SamirTalwar avatar Aug 14 '22 17:08 SamirTalwar

Understandable. I have similar time constraints and a trip this week, so please don't feel pressured. Maybe you want to share a branch to see the progress or suggest which places in the codebase to look first for anyone to experiment.

jonaprieto avatar Aug 15 '22 12:08 jonaprieto

This boils down to adding a new assertion type in Types/Assert.hs, handling it in Assert.hs, and parsing it in Types/Values.hs.

I have laid a little bit of the groundwork in the regex branch, which might be a decent place to start if you want to give it a shot.

SamirTalwar avatar Aug 15 '22 20:08 SamirTalwar

Thinking about it, we may be able to just ignore the parsing issue that tripped me up. The main issue is that we can write:

stdout:
  contains: "hello"

But we can't write:

stdout:
  content:
    contains: "hello"

This is inconsistent with assertions that use a file. But perhaps we can just live with it for now, as long as we can write matches:.

SamirTalwar avatar Aug 15 '22 20:08 SamirTalwar

I have managed to make the time to sort out those vaguely-related issues.

The next challenge is picking a regex engine. I was hoping to support Perl-compatible regular expressions (PCRE) as they're very powerful and I like them a lot. However, all the implementations require linking to a native library, which (a) means that we have a dynamic dependency we didn't have before, and (b) Windows support will be a nightmare.

I tried getting both regex-pcre and text-icu to build with no luck, but even if they worked, I think asking people to install PCRE or ICU4C on Windows via MSYS2 is a bit much.

I guess I'll be trying regex-pcre-builtin or regexpr next.

SamirTalwar avatar Aug 18 '22 19:08 SamirTalwar

I left this for a few months and when I came back, Windows support magically started working.

I've now added this in #79. Please give it a shot, following the README changes, and let me know if you have any feedback. I'll merge it and ship a new release if it looks good for you (or in a few days if I get no response).

SamirTalwar avatar Nov 13 '22 14:11 SamirTalwar

I ended up reverting this change because I couldn't release; there seems to be no way to use text-icu statically across operating systems and architectures.

Now I'm considering the regex-tdfa library, which supports POSIX regular expressions. They would be more limited. If you would still find this feature useful, could you let me know if that suits your purposes?

SamirTalwar avatar Feb 22 '23 09:02 SamirTalwar

Gonna re-open this. In #100, @ivan19871002 points out that regex-tdfa has some memory consumption issues, so that's not really an option. I will keep looking for a pure Haskell alternative.

SamirTalwar avatar Feb 22 '23 21:02 SamirTalwar

I revert this commit https://github.com/SamirTalwar/smoke/commit/888148daff63efd38bae7334f778c25934963ed1. re-add regex support. works well on all platform (x86_x64 linux , x86 mac, aarch64 linux)

ivan19871002 avatar Mar 06 '23 08:03 ivan19871002

I'm glad it's working well for you.

I have done the same thing and pushed the changes as the regex-icu branch, which you are welcome to use instead. I will support this branch and fix any issues until I can find a better solution, or figure out how to compile things statically on Windows and macOS.

SamirTalwar avatar Mar 06 '23 10:03 SamirTalwar

macOS

brew link icu4c

before compile

export LDFLAGS="-L/usr/local/opt/icu4c/lib" export CPPFLAGS="-I/usr/local/opt/icu4c/include" export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig"

then stack install --local-bin-path=./out/build

ivan19871002 avatar Mar 07 '23 03:03 ivan19871002