qsharp Test algorithm samples with and without debugging

trafficstars

This change introduces a scenario test under qsc that verifies the behavior of the algorithm samples with normal and debug execution. It uses fixed seeds to ensure results are predictable and restricts the slowest test (Shors.qs) to be release only.

Included is some cleanup of output from the samples so that compatible characters are used and no trailing spaces are present.

May 16 '24 05:05 swernli

Benchmark for ddf9cd0

Click to view benchmark

Test	Base	PR	%
Array append evaluation	330.6±10.98µs	327.7±2.43µs	-0.88%
Array literal evaluation	173.9±0.75µs	174.2±1.66µs	+0.17%
Array update evaluation	410.9±9.32µs	405.5±4.49µs	-1.31%
Core + Standard library compilation	17.2±0.20ms	17.2±0.36ms	0.00%
Deutsch-Jozsa evaluation	5.0±0.05ms	5.0±0.16ms	0.00%
Large file parity evaluation	34.0±0.08ms	33.9±0.10ms	-0.29%
Large input file compilation	12.0±0.28ms	12.0±0.18ms	0.00%
Large input file compilation (interpreter)	45.4±1.34ms	45.1±1.37ms	-0.66%
Large nested iteration	31.4±0.09ms	32.2±0.13ms	+2.55%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1556.6±40.68µs	1562.2±44.04µs	+0.36%
Perform Runtime Capabilities Analysis (RCA) on large file sample	7.8±0.07ms	7.8±0.05ms	0.00%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1414.3±27.02µs	1422.3±35.32µs	+0.57%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	27.0±0.34ms	27.3±1.27ms	+1.11%
Teleport evaluation	87.0±3.39µs	88.4±6.81µs	+1.61%

May 16 '24 06:05 github-actions[bot]

Benchmark for 1eb7d76

Click to view benchmark

Test	Base	PR	%
Array append evaluation	330.2±1.94µs	331.1±4.98µs	+0.27%
Array literal evaluation	190.2±2.46µs	189.5±0.74µs	-0.37%
Array update evaluation	408.6±2.62µs	408.5±2.88µs	-0.02%
Core + Standard library compilation	20.4±0.92ms	21.4±0.79ms	+4.90%
Deutsch-Jozsa evaluation	5.0±0.05ms	5.0±0.15ms	0.00%
Large file parity evaluation	34.0±0.13ms	34.0±0.15ms	0.00%
Large input file compilation	13.1±0.54ms	13.9±0.49ms	+6.11%
Large input file compilation (interpreter)	49.9±1.95ms	51.5±1.39ms	+3.21%
Large nested iteration	32.7±0.18ms	32.6±0.17ms	-0.31%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1584.3±79.08µs	1617.9±162.14µs	+2.12%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.1±0.15ms	8.0±0.13ms	-1.23%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1453.3±88.76µs	1466.9±117.63µs	+0.94%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.2±0.31ms	28.2±0.24ms	0.00%
Teleport evaluation	87.2±3.92µs	88.4±5.30µs	+1.38%

May 20 '24 21:05 github-actions[bot]

Benchmark for 5ddfcb0

Click to view benchmark

Test	Base	PR	%
Array append evaluation	326.9±2.31µs	337.5±15.57µs	+3.24%
Array literal evaluation	177.4±1.52µs	194.6±4.09µs	+9.70%
Array update evaluation	409.8±3.69µs	417.9±5.18µs	+1.98%
Core + Standard library compilation	21.1±1.25ms	21.4±1.24ms	+1.42%
Deutsch-Jozsa evaluation	5.1±0.04ms	5.1±0.31ms	0.00%
Large file parity evaluation	34.1±0.60ms	34.1±0.48ms	0.00%
Large input file compilation	14.0±0.57ms	13.9±0.60ms	-0.71%
Large input file compilation (interpreter)	51.8±1.65ms	50.8±1.58ms	-1.93%
Large nested iteration	32.6±1.36ms	33.1±0.18ms	+1.53%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1597.4±91.15µs	1605.7±135.00µs	+0.52%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.2±0.19ms	8.0±0.15ms	-2.44%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1459.9±111.91µs	1465.3±140.93µs	+0.37%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.5±0.32ms	28.5±0.43ms	0.00%
Teleport evaluation	88.6±4.44µs	87.8±3.72µs	-0.90%

May 22 '24 23:05 github-actions[bot]

Benchmark for 9504d80

Click to view benchmark

Test	Base	PR	%
Array append evaluation	338.1±13.86µs	337.3±6.25µs	-0.24%
Array literal evaluation	186.7±1.16µs	174.2±0.87µs	-6.70%
Array update evaluation	419.0±16.97µs	417.1±10.75µs	-0.45%
Core + Standard library compilation	18.7±0.20ms	18.7±0.30ms	0.00%
Deutsch-Jozsa evaluation	5.1±0.04ms	5.1±0.06ms	0.00%
Large file parity evaluation	34.1±0.13ms	34.1±0.30ms	0.00%
Large input file compilation	12.5±0.18ms	12.5±0.30ms	0.00%
Large input file compilation (interpreter)	46.5±0.82ms	46.3±0.68ms	-0.43%
Large nested iteration	33.5±0.60ms	33.3±1.45ms	-0.60%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1568.5±29.33µs	1563.2±40.29µs	-0.34%
Perform Runtime Capabilities Analysis (RCA) on large file sample	7.8±0.10ms	7.7±0.10ms	-1.28%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1427.3±30.14µs	1423.6±30.24µs	-0.26%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.1±0.17ms	27.8±0.18ms	-1.07%
Teleport evaluation	89.8±5.73µs	89.5±8.07µs	-0.33%

May 31 '24 21:05 github-actions[bot]

Benchmark for 6dffb80

Click to view benchmark

Test	Base	PR	%
Array append evaluation	338.6±10.08µs	338.2±7.77µs	-0.12%
Array literal evaluation	172.9±2.00µs	183.8±0.75µs	+6.30%
Array update evaluation	416.0±10.48µs	416.2±11.49µs	+0.05%
Core + Standard library compilation	18.9±0.44ms	18.9±0.29ms	0.00%
Deutsch-Jozsa evaluation	5.0±0.06ms	5.0±0.09ms	0.00%
Large file parity evaluation	34.1±0.12ms	34.1±0.07ms	0.00%
Large input file compilation	12.4±0.13ms	12.8±0.57ms	+3.23%
Large input file compilation (interpreter)	46.9±1.01ms	46.7±1.03ms	-0.43%
Large nested iteration	32.4±0.29ms	32.9±0.51ms	+1.54%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1572.6±37.71µs	1572.1±36.08µs	-0.03%
Perform Runtime Capabilities Analysis (RCA) on large file sample	7.9±0.10ms	7.8±0.08ms	-1.27%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1433.7±32.25µs	1424.6±29.71µs	-0.63%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.3±0.25ms	28.6±0.28ms	+1.06%
Teleport evaluation	87.9±4.51µs	88.2±5.91µs	+0.34%

Jun 05 '24 05:06 github-actions[bot]

I don't understand how there can be such a dramatic difference in time taken to run all samples twice vs the existing "build all samples" logic.

running all samples twice: 1.88s

existing "Building qsharp samples" step: 109.929s

If there's some black magic that makes the first one run so fast, maybe we can remove the existing step?!?

Jun 07 '24 19:06 minestarks

I don't understand how there can be such a dramatic difference in time taken to run all samples twice vs the existing "build all samples" logic. ... If there's some black magic that makes the first one run so fast, maybe we can remove the existing step?!?

Ha, yeah, it's quite dramatic, isn't it? I'll go into more detail below for posterity, but the short answer is that it does things in memory, in parallel, rather than sequentially with disk reads. That said, I think the other test is still worthwhile (and maybe can be moved to integration tests?), in part because it builds all samples including language samples rather than just the algorithm samples. As soon as a new qs file or qsharp.json project is added to the samples folders, it will get picked up automatically by this part of the build. If after this PR (and ongoing work to add verification output to the language samples) we want to pull those into a test pattern like this, then we can revisit the question of whether the older style is still needed. Longer explanation below...

When compiling these as unit tests, the test files are brought in using the include_str! macro, which tells Rust to include the contents of that file at compile time so that the running test can read the contents from the data section of the process memory rather than needing to load it from file. Rust also handily keeps track of any files pulled in this way, so that when the file is changed on disk it triggers the relevant rebuilds even if there are no other changes to the code. And during runtime, the tests are run in parallel, so can take advantage of as many cores as are available on the machine. Compare this to how the tests run using qsc in the build script: the folders are searched on disk, each file path is passed as an argument to a separate process invocation, which needs to read the file from disk in order to compile it. All that sequential process start and disk access overwhelms the actual runtime of the compiler. The tradeoff is needing to add each sample we want to test individually as test case rather than have it be auto-detected from disk. In my opinion, the scale of the perf difference makes that additional maintenance burden worthwhile.

Jun 11 '24 04:06 swernli

Can we maybe add a build script to iterate through the samples and include all of them in the Rust tests? The fact that each sample is hardcoded sadly takes away from the usefulness of these tests... I can guarantee that the next person to add a sample will not think of adding them to this list :/

Jun 13 '24 17:06 minestarks

Can we maybe add a build script to iterate through the samples and include all of them in the Rust tests? The fact that each sample is hardcoded sadly takes away from the usefulness of these tests... I can guarantee that the next person to add a sample will not think of adding them to this list :/

Not while still using expect![] macros to verify the output inline, no. Each one really does need to be individually added. If we switched to a mode where the output we compare to was stored in files, then maybe... but it would definitely get more complicated. The samples testing is kind of a twist on the old "iron triangle" adage where in this case the trade-off is test speed vs test simplicity vs test automation, and we only get to pick two.

Jun 13 '24 18:06 swernli

We could have one big giant expect![] that verifies the result of dynamically concatenated program outputs, right?

It's not obviously clear to me which way's better at the moment, you're right, each approach has its drawbacks - I'll let it sit in the back of my mind for a bit.

Jun 14 '24 18:06 minestarks

I have a couple crazy ideas on how we could add a bit more automation to this, but I need to experiment with them more before I know how bad/good the ideas are...

Jun 19 '24 06:06 swernli

Benchmark for d76d1a6

Click to view benchmark

Test	Base	PR	%
Array append evaluation	329.4±10.99µs	327.4±2.15µs	-0.61%
Array literal evaluation	192.5±1.37µs	174.8±1.76µs	-9.19%
Array update evaluation	403.2±1.30µs	408.1±15.83µs	+1.22%
Core + Standard library compilation	23.9±1.34ms	23.1±1.03ms	-3.35%
Deutsch-Jozsa evaluation	5.1±0.06ms	5.2±0.05ms	+1.96%
Large file parity evaluation	34.3±0.18ms	34.3±0.31ms	0.00%
Large input file compilation	14.2±0.52ms	14.0±0.46ms	-1.41%
Large input file compilation (interpreter)	53.3±1.47ms	53.3±1.69ms	0.00%
Large nested iteration	32.2±0.27ms	32.4±0.27ms	+0.62%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1612.9±132.56µs	1620.2±157.41µs	+0.45%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.3±0.18ms	8.2±0.15ms	-1.20%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1463.9±117.49µs	1474.3±114.19µs	+0.71%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.7±0.34ms	28.9±0.49ms	+0.70%
Teleport evaluation	90.9±3.61µs	89.3±3.94µs	-1.76%

Jun 19 '24 07:06 github-actions[bot]

Benchmark for 49dcbd1

Click to view benchmark

Test	Base	PR	%
Array append evaluation	327.5±2.20µs	329.4±6.56µs	+0.58%
Array literal evaluation	174.6±1.24µs	180.9±5.15µs	+3.61%
Array update evaluation	406.4±1.88µs	408.0±2.96µs	+0.39%
Core + Standard library compilation	21.3±0.96ms	20.3±0.37ms	-4.69%
Deutsch-Jozsa evaluation	5.1±0.05ms	5.1±0.06ms	0.00%
Large file parity evaluation	34.2±0.08ms	34.2±0.16ms	0.00%
Large input file compilation	13.3±0.31ms	12.7±0.40ms	-4.51%
Large input file compilation (interpreter)	50.0±1.48ms	49.8±1.37ms	-0.40%
Large nested iteration	32.9±0.94ms	32.7±0.23ms	-0.61%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1623.3±138.44µs	1583.6±40.83µs	-2.45%
Perform Runtime Capabilities Analysis (RCA) on large file sample	7.8±0.17ms	7.8±0.09ms	0.00%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1448.6±74.39µs	1444.1±30.77µs	-0.31%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.1±0.26ms	28.1±0.50ms	0.00%
Teleport evaluation	89.8±3.43µs	89.6±3.12µs	-0.22%

Jun 19 '24 08:06 github-actions[bot]

We'll also need a user-facing folder samples/testing when we get to adding samples of how to test Q# code. So, without looking at the code, I'm in favor of keeping Rust test code that does not target users outside of the samples folder so as not to confuse these two.

On Wed, Jun 19, 2024 at 7:43 AM Stefan J. Wernli @.***> wrote:

@.**** commented on this pull request.

On samples_test/Cargo.toml https://github.com/microsoft/qsharp/pull/1533#discussion_r1646331297:

I debated on that one. My thinking was that right now the samples folder is pretty cleanly for folks looking to find Q# samples. We link to it from other places, like QCOM. If there were Rust tests mixed in, especially slightly arcane ones like these that are generated from a build script, it makes the folder confusing for newcomers.

On a related note, I also debated about keeping these as a test under "compiler/qsc" where I had originally put them. But it occurred to me that with the way the build script is set up to watch the samples folder, any change to samples would cause qsc to be rebuilt, which is not ideal. This way there is a clearer division: samples_test depends on samples and qsc, so changes to either of those will rebuild samples_test, but qsc won't depend on the samples folder.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/qsharp/pull/1533#discussion_r1646331297, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNFAADBN4W5KV3T7NRWNLTZIGKKXAVCNFSM6AAAAABHZPYGA6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMRYGQ4DSNJVGA . You are receiving this because your review was requested.Message ID: @.***>

Jun 19 '24 14:06 tcNickolas

Benchmark for a40e6d2

Click to view benchmark

Test	Base	PR	%
Array append evaluation	327.5±1.39µs	330.8±2.52µs	+1.01%
Array literal evaluation	189.5±2.83µs	198.7±0.87µs	+4.85%
Array update evaluation	413.0±1.83µs	415.5±1.90µs	+0.61%
Core + Standard library compilation	22.8±0.62ms	23.1±0.63ms	+1.32%
Deutsch-Jozsa evaluation	5.1±0.07ms	5.1±0.04ms	0.00%
Large file parity evaluation	34.4±0.18ms	34.6±0.98ms	+0.58%
Large input file compilation	13.6±0.34ms	14.1±0.48ms	+3.68%
Large input file compilation (interpreter)	55.2±1.96ms	54.4±2.02ms	-1.45%
Large nested iteration	32.1±0.28ms	32.4±0.39ms	+0.93%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1612.4±126.24µs	1611.5±143.40µs	-0.06%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.0±0.14ms	8.1±0.14ms	+1.25%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1480.3±172.54µs	1477.8±178.45µs	-0.17%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.7±0.27ms	28.8±0.28ms	+0.35%
Teleport evaluation	89.4±3.72µs	91.0±3.55µs	+1.79%

Jun 20 '24 19:06 github-actions[bot]

Benchmark for f46ca1c

Click to view benchmark

Test	Base	PR	%
Array append evaluation	333.4±18.58µs	329.0±2.23µs	-1.32%
Array literal evaluation	189.1±2.25µs	189.4±0.89µs	+0.16%
Array update evaluation	413.1±7.34µs	410.7±1.58µs	-0.58%
Core + Standard library compilation	21.5±1.03ms	21.2±1.42ms	-1.40%
Deutsch-Jozsa evaluation	5.1±0.04ms	5.1±0.05ms	0.00%
Large file parity evaluation	34.3±0.11ms	34.4±0.64ms	+0.29%
Large input file compilation	12.9±0.82ms	12.9±0.63ms	0.00%
Large input file compilation (interpreter)	51.1±2.19ms	50.7±2.74ms	-0.78%
Large nested iteration	32.0±0.14ms	32.0±0.41ms	0.00%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1574.0±86.12µs	1601.2±224.70µs	+1.73%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.0±0.30ms	8.2±0.36ms	+2.50%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1434.1±84.56µs	1456.7±68.23µs	+1.58%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.5±0.75ms	28.6±0.40ms	+0.35%
Teleport evaluation	90.2±3.72µs	90.0±3.42µs	-0.22%

Jun 25 '24 21:06 github-actions[bot]

Benchmark for 7e1ffa5

Click to view benchmark

Test	Base	PR	%
Array append evaluation	333.7±4.75µs	331.6±2.14µs	-0.63%
Array literal evaluation	189.0±0.54µs	190.1±4.44µs	+0.58%
Array update evaluation	412.8±1.40µs	412.3±1.04µs	-0.12%
Core + Standard library compilation	22.7±0.84ms	23.4±1.31ms	+3.08%
Deutsch-Jozsa evaluation	5.1±0.05ms	5.1±0.06ms	0.00%
Large file parity evaluation	34.4±0.08ms	34.3±0.08ms	-0.29%
Large input file compilation	14.3±0.70ms	14.2±0.62ms	-0.70%
Large input file compilation (interpreter)	53.0±1.56ms	55.5±1.98ms	+4.72%
Large nested iteration	32.4±0.93ms	32.2±0.31ms	-0.62%
Perform Runtime Capabilities Analysis (RCA) on Deutsch-Jozsa sample	1622.7±152.86µs	1626.1±151.32µs	+0.21%
Perform Runtime Capabilities Analysis (RCA) on large file sample	8.2±0.13ms	8.2±0.14ms	0.00%
Perform Runtime Capabilities Analysis (RCA) on teleport sample	1481.0±143.77µs	1473.7±106.74µs	-0.49%
Perform Runtime Capabilities Analysis (RCA) on the core and std libraries	28.9±0.98ms	29.1±0.89ms	+0.69%
Teleport evaluation	90.4±3.58µs	92.3±8.87µs	+2.10%

Jun 25 '24 21:06 github-actions[bot]

Great! Now maybe we can add the language samples too and get rid of the step in Python? ;)

Finally put the ideas this question triggered into a PR: https://github.com/microsoft/qsharp/pull/1739. Is the shift away from Python, qsc-based tests worth the build.rs complexity? We can let the team discuss. But at least we know it's possible!

Jul 15 '24 14:07 swernli

qsharp qsharp copied to clipboard

Test algorithm samples with and without debugging

Benchmark for ddf9cd0

Benchmark for 1eb7d76

Benchmark for 5ddfcb0

Benchmark for 9504d80

Benchmark for 6dffb80

Benchmark for d76d1a6

Benchmark for 49dcbd1

@.**** commented on this pull request.

Benchmark for a40e6d2

Benchmark for f46ca1c

Benchmark for 7e1ffa5

qsharp
qsharp copied to clipboard