Passing in a custom Iterable as a data provider causes Spock to iterate over each item multiple times
Describe the bug
Hi,
I've been using Spock's data driven testing features to pass in a custom Iterable that requests files from an external DB and iterates over each file returned. I'm using the custom Iterable approach documented here as opposed to getting all files initially and using the results to avoid keeping all input data across test cases in memory. While the code works, it makes extra calls to the DB that aren't required to execute the test. These calls extend test runtime and make unnecessary requests to the data store on every execution. However, if I call .iterator on the Iterable, and pass that in as the data provider, the test works as expected, making only the necessary calls to execute the test.
After digging through the code, I found that Spock has this logic that estimates the number of iterations on an Iterable, but does not do so for the Iterator. This is the only reason I could find why the custom Iterator is being constructed multiple times and the unnecessary calls are being made.
This is confusing to the user since the Iterator is being run through multiple times without any indication to the user or documentation why. In my opinion, we should make this behavior more clear to the end user and/or provide a way to specify not to estimate the number of iterations, since it may lead to requesting data multiple times in a way that's not transparent to the user. I've attached a simplified example.
While passing in an iterator worked for me, it's non-intuitive and required going through the Spock code. Other users might also face this issue in the future and more documentation combined with a way to turn this feature off would be extremely helpful. Thanks!
To Reproduce
package org.spockframework.util
import spock.lang.Specification
class IterableRunTwiceSpec extends Specification {
// calls iterator constructor multiple times
def "iterable_run_twice"(String data) {
expect:
data == "item from DB"
where:
data << new ExpensiveIterable();
}
// only makes the minimum number of queries required
def "iterable_as_iterator_run_once"() {
expect:
data == "item from DB"
where:
data << new ExpensiveIterable().iterator()
}
}
class DataStore {
int queriesMadeToDB = 0;
DataStore() {
System.out.println("Making connection to data store")
}
String getNext() {
System.out.println("fetching item " + queriesMadeToDB++ + " from DB");
return "item from DB";
}
}
class ExpensiveIterable implements Iterable<String> {
DataStore dataStore;
ExpensiveIterable() {
System.out.println("Constructing expensive iterable");
dataStore = new DataStore();
}
@Override
Iterator<String> iterator() {
return new ExpensiveIterator(dataStore);
}
// get first 5 items from DB, only makes request to DB when next item required
class ExpensiveIterator implements Iterator<String> {
int queries = 0;
DataStore dataStore;
// Makes calls to same data store repeatedly
ExpensiveIterator(final DataStore dataStore) {
this.dataStore = dataStore;
System.out.println("Constructing expensive iterator");
}
// Other option: constructs connection to data store multiple times (expensive when constructor performs some work)
ExpensiveIterator() {
dataStore = new DataStore();
System.out.println("Constructing expensive iterator")
}
@Override
boolean hasNext() {
return queries != 5;
}
@Override
String next() {
queries = queries + 1
return dataStore.next;
}
}
}
The first test case prints:
Constructing expensive iterable
Making connection to data store
Constructing expensive iterator
fetching item 0 from DB
fetching item 1 from DB
fetching item 2 from DB
fetching item 3 from DB
fetching item 4 from DB
Constructing expensive iterator
fetching item 5 from DB
fetching item 6 from DB
fetching item 7 from DB
fetching item 8 from DB
fetching item 9 from DB
Constructing expensive iterator
fetching item 10 from DB
fetching item 11 from DB
fetching item 12 from DB
fetching item 13 from DB
fetching item 14 from DB
the first test case with the second Iterator constructor prints:
Constructing expensive iterable
Making connection to data store
Making connection to data store
Constructing expensive iterator
fetching item 0 from DB
fetching item 1 from DB
fetching item 2 from DB
fetching item 3 from DB
fetching item 4 from DB
Making connection to data store
Constructing expensive iterator
fetching item 0 from DB
fetching item 1 from DB
fetching item 2 from DB
fetching item 3 from DB
fetching item 4 from DB
Making connection to data store
Constructing expensive iterator
fetching item 0 from DB
fetching item 1 from DB
fetching item 2 from DB
fetching item 3 from DB
fetching item 4 from DB
and the second test case (with the first Iterator constructor) prints:
Constructing expensive iterable
Making connection to data store
Constructing expensive iterator
fetching item 0 from DB
fetching item 1 from DB
fetching item 2 from DB
fetching item 3 from DB
fetching item 4 from DB
Expected behavior
I expect that when passing in an object implementing the Iterable contract as a data provider, Spock only constructs a single Iterator and requests only the data necessary to complete the execution by default. The last output (where the Iterator is run through only once) should be what happens in every scenario.
Actual behavior
When passing in an Iterable as a data provider, the framework constructs the Iterator 3 times and fetches the items from the Iterator 3 times instead of once. In my use case, this led to making 3x the amount of requests required and a slower test runtime.
Java version
Java 17
Buildtool version
Gradle 8.10.2
What operating system are you using
Mac
Dependencies
__$$asciidoctorj$$___d (n)
+--- org.asciidoctor:asciidoctorj:2.5.7 (n)
\--- org.asciidoctor:asciidoctorj-diagram:2.2.3 (n)
__$$asciidoctorj$$___r
+--- org.asciidoctor:asciidoctorj:2.5.7 -> 2.5.13
| +--- com.beust:jcommander:1.82
| +--- org.asciidoctor:asciidoctorj-api:2.5.13
| \--- org.jruby:jruby:9.4.7.0 -> org.jruby:jruby-complete:9.4.7.0
\--- org.asciidoctor:asciidoctorj-diagram:2.2.3
+--- org.asciidoctor:asciidoctorj-diagram-plantuml:1.2022.5
\--- org.asciidoctor:asciidoctorj-diagram-ditaamini:1.0.3
asciidoctorExtensions
\--- spockbuild:asciidoc-extensions -> project :build-logic:asciidoc-extensions
checkstyle - The Checkstyle libraries to be used for this project.
+--- com.puppycrawl.tools:checkstyle:9.3
| +--- info.picocli:picocli:4.6.2
| +--- org.antlr:antlr4-runtime:4.9.3
| +--- commons-beanutils:commons-beanutils:1.9.4
| | \--- commons-collections:commons-collections:3.2.2
| +--- com.google.guava:guava:31.0.1-jre
| | +--- com.google.guava:failureaccess:1.0.1
| | +--- com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava
| | +--- com.google.code.findbugs:jsr305:3.0.2
| | +--- org.checkerframework:checker-qual:3.12.0
| | +--- com.google.errorprone:error_prone_annotations:2.7.1
| | \--- com.google.j2objc:j2objc-annotations:1.3
| +--- org.reflections:reflections:0.10.2
| | +--- org.javassist:javassist:3.28.0-GA
| | \--- com.google.code.findbugs:jsr305:3.0.2
| \--- net.sf.saxon:Saxon-HE:10.6
\--- io.spring.nohttp:nohttp-checkstyle:0.0.11
+--- com.puppycrawl.tools:checkstyle:8.33 -> 9.3 (*)
\--- io.spring.nohttp:nohttp:0.0.11
\--- ch.qos.logback:logback-classic:1.2.3
\--- ch.qos.logback:logback-core:1.2.3
jacocoAgent - The Jacoco agent to use to get coverage data.
\--- org.jacoco:org.jacoco.agent:0.8.12
jacocoAnt - The Jacoco ant tasks to use to get execute Gradle tasks.
\--- org.jacoco:org.jacoco.ant:0.8.12
+--- org.jacoco:org.jacoco.core:0.8.12
| +--- org.ow2.asm:asm:9.7
| +--- org.ow2.asm:asm-commons:9.7
| | +--- org.ow2.asm:asm:9.7
| | \--- org.ow2.asm:asm-tree:9.7
| | \--- org.ow2.asm:asm:9.7
| \--- org.ow2.asm:asm-tree:9.7 (*)
+--- org.jacoco:org.jacoco.report:0.8.12
| \--- org.jacoco:org.jacoco.core:0.8.12 (*)
\--- org.jacoco:org.jacoco.agent:0.8.12
nohttp
\--- io.spring.nohttp:nohttp-checkstyle:0.0.11
+--- com.puppycrawl.tools:checkstyle:8.33
| +--- info.picocli:picocli:4.3.1
| +--- antlr:antlr:2.7.7
| +--- org.antlr:antlr4-runtime:4.8-1
| +--- commons-beanutils:commons-beanutils:1.9.4
| | +--- commons-logging:commons-logging:1.2
| | \--- commons-collections:commons-collections:3.2.2
| +--- com.google.guava:guava:29.0-jre
| | +--- com.google.guava:failureaccess:1.0.1
| | +--- com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava
| | +--- com.google.code.findbugs:jsr305:3.0.2
| | +--- org.checkerframework:checker-qual:2.11.1
| | +--- com.google.errorprone:error_prone_annotations:2.3.4
| | \--- com.google.j2objc:j2objc-annotations:1.3
| \--- net.sf.saxon:Saxon-HE:9.9.1-7
\--- io.spring.nohttp:nohttp:0.0.11
+--- ch.qos.logback:logback-classic:1.2.3
| +--- ch.qos.logback:logback-core:1.2.3
| \--- org.slf4j:slf4j-api:1.7.25 -> 1.7.26
\--- org.slf4j:slf4j-api:1.7.26
nohttp-cli
\--- io.spring.nohttp:nohttp-cli:0.0.11
+--- info.picocli:picocli:3.9.5
\--- io.spring.nohttp:nohttp:0.0.11
+--- ch.qos.logback:logback-classic:1.2.3
| +--- ch.qos.logback:logback-core:1.2.3
| \--- org.slf4j:slf4j-api:1.7.25 -> 1.7.26
\--- org.slf4j:slf4j-api:1.7.26
Additional context
Please let me know if you need any more information. My personal suggestion would be to provide both documentation and a required flag whether the Iterable is safe to call .size() on. Custom data providers are one of Spock's biggest value-adds and I think the experience could be made even better with this change.
PR #2027 will fix the issue, by checking if the Iterable is also a Collection and if not (so no size() method by contract), it will treat it the same as an unknown iterator.
I don't think anything should be changed here. At most maybe adding some more clarifying documentation. There is no way to know whether a data provider is expensive or not.
There are multiple ways to care for this from user-side.
The user can give in an Iteratror as they are only one-time consumable.
Or they can make sure that the iterable caches the expensive operation.
The only way would be a white-list of things that are definitely safe and non-expensive and that would take away much flexibility from Spock users.
I tend to agree with @Vampire and I also acknowledge that the documentation is misleading, especially the sentence:
Data providers are queried for their next value only when needed (before the next iteration).
We should extend the documentation with a technical note about the size() and the mitigation of just calling .iterator() if you don't want multiple invocations.
I have changed #2027 to add the note and removed the change in the data provides.
I am still not convinced that this is the right choice, because the call to the Groovy default size() method, has never a benefit IMHO. We iterate the iterable 3 times, to "estimate" the size 2 times and once for test execution.
How can this be better for the user instead of just don't estimate the size?
How can this be better for the user instead of just don't estimate the size?
Again, you never know whether a size() method is unsafe or not.
An implementation of Iterable that is not a Collection could for example have a built-in size() method that is efficient.
Or it could have an iterator() method that once fetches the values from the database and then always returns an iterator over the cached values and thus also be efficient and so on.
The point is, that Spock specs are exclusively written in Groovy currently, so we are in Groovy land and should follow Groovy semantics or the next user will come and complain it doesn't work as expected.
In Groovy semantics you have duck-typing as you for sure know, so if something has a size() method, you can call the size() method and get the size of that object.
And if something has an iterator() method that returns an Iterator, Groovy semantics say you can iterate that object.
And that is exactly happens and what our docs say with "Any object that Groovy knows how to iterate over can be used as a data provider.", which even includes Iterator which in Groovy has an iterator() method that returns itself.
And for any object Groovy knows how to iterate, it usually also knows how to calculate the size on.
Just for Iterator#size() it specifies that this of course drains the iterator, so for Iterator we do not call size() as it would not only be slow but make the tests execute wrongly as the iterator gets drained (and if the size() method does not return a positive number in which case we ignore the result).
Whether a user provides an efficient or inefficient object Groovy knows how to iterate over and calulate the size of is totally out of our control and we can never know if a given instance provides an efficient or inefficient implementation. So as I said, we cannot know whether calling size() will be efficient or not and thus should not just assume it is inefficient just because it implements a random interface but not another one which in Groovy semantics is anyway just hollow words. The only thing we imho sanely can do indeed is to better warn the user that a data provider could be iterated multiple times and that he should either provide an efficient implementation, cope with the penalty, or supply an Iterator instead.
Also, the "estimated num iterations" is only "estimated" for two reasons.
- For the probably very rare case that a data provider has an
iterator()method but nosize()method - For the case that an
Iteratoris given as data provider that does not have a built-insize()method as then the GDKsize()method must not be used because of the draining
That size() is called twice though indeed is a regression and totally unnecessary. This will be fixed with #2032.