faraday
faraday copied to clipboard
How to do paging with scan-parallel?
I'm trying to do paging with scan-parallel
using :limit
, but I'm not sure how to specify :last-prim-kvs
in subsequent calls. Each segment needs its own last key, I presume.
Am I missing something, or is paging not implemented for parallel scan?
Hi Ulrik, sorry for the delay responding to this.
Just to clarify: have you used :limit successfully with scan
and are having trouble getting it to work with scan-parallel
specifically, or you aren't sure how to use :limit
in general and you happen to be using scan-parallel
?
I haven't tested it (don't have any db creds with me atm) - but I can't think of a reason why :limit shouldn't work with scan-parallel
. It's just a thin scan
wrapper to help handle the segment args automatically:
(defn scan-parallel
"Like `scan` but starts a number of worker threads and automatically handles
parallel scan options (:total-segments and :segment). Returns a vector of
`scan` results.
Ref. http://goo.gl/KLwnn (official parallel scan documentation)."
[creds table total-segments & [opts]]
(let [opts (assoc opts :total-segments total-segments)]
(->> (mapv (fn [seg] (future (scan creds table (assoc opts :segment seg))))
(range total-segments))
(mapv deref))))
As for using :limit
+ :last-prim-kvs
: any time you see prim-kvs
in a docstring/arg-name it means an argument of form {<hash-key> <val>}
or {<hash-key> <val> <range-key> <val>}
- i.e. the same form used by get-item
, etc.
So to implement paging you'd want to do something like this [untested, don't have a db with me]:
(scan creds :my-table {:limit 2 :attr-conds {:age [:in [24 27]]}})
=> [{:age 24, :name \"Steve\"} {:age 27, :name \"Susan\"}]
(scan creds :my-table {:last-prim-kvs {:age 24 :name \"Susan\"} :attr-conds {:age [:in [24 27]]}})
Does that help?
I am using :limit
and paging successfully with scan
, but I can't understand how to do it with scan-parallel
. Or is it perhaps so that paging is not possible with scan-parallel
, because the order is not predictable or something?
I have around a million entries that I want to process, and I don't want to read them all into memory at once. I'm currently using scan
with :limit
and paging, processing a batch at a time. However, I have trouble reaching the provisioned limits using scan
, so I figured I could use scan-parallel
. But perhaps it's not designed to support paging.
I am using :limit and paging successfully with scan, but I can't understand how to do it with scan-parallel
Sorry I don't have any test dbs on hand atm - it'd help if you could be a little more specific. Are you seeing an error when you replace scan
with scan-parallel
as in the example I provided above?
Or is it perhaps so that paging is not possible with scan-parallel
It should be possible. Unless I'm misunderstanding something about what you're trying to do - it should literally be as simple as replacing scan
with scan-parallel
in your call. No args need to change. Nothing about your methodology needs to change. It should work as a drop-in replacement. What happens when you do that?
I didn't want to provide lots of details if I was completely misunderstanding the functionality of scan-parallel
, but if you say that paging should work, then let's press on. I'll give you details soon. Meanwhile, consider this:
scan-parallel
just passes the given opts
on to each underlying scan
, with the corresponding :segment
number added on, right? If I want to send :last-prim-kvs
as opts
, like I did when I was doing paging with just scan
, then how should I specify those? Each segment needs its own starting point, but as far as I can understand, I can only specify a single :last-prim-kvs
. Which segment does that go to? The first? What about the other segments? It just doesn't make sense to me.
I am using :limit
and paging successfully with scan
:
(scan creds :my-table {:limit 1})
This will give me a vector containing the first page of entries (the actual number depends, in my case 5):
[{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]
In the next call, I set :last-prim-kvs
to {:id 5}
, to indicate that I want the scan to start after that id:
(scan creds :my-table {:last-prim-kvs {:id 5} :limit 1})
This will give me the next page of entries:
[{:id 6, :x "f"} {:id 7, :x "g"} {:id 8, :x "h"} {:id 9, :x ""} {:id 10, :x "i"}]
I can't understand how to do it with scan-parallel
. The first call is obvious, though. I'm requesting two segments:
(scan-parallel creds :my-table 2 {:limit 1})
This will give me a vector of size 2, where each element is a vector containing some page of entries, not necessarily page one and two:
[
[{:id 1, :x "a"} {:id 2, :x "b"} {:id 3, :x "c"} {:id 4, :x "d"} {:id 5, :x "e"}]
[{:id 16, :x "s"} {:id 17, :x "r"} {:id 18, :x "k"} {:id 19, :x "q"} {:id 20, :x "p"}]
]
What about the subsequent calls for the remaining pages? How do I specify :last-prim-kvs
? If I do it like with scan
, I get this error:
user=> (scan-parallel creds :my-table 2 {:last-prim-kvs {:id "5"} :limit 1})
AmazonServiceException The provided starting key is invalid: Invalid ExclusiveStartKey.
Please use ExclusiveStartKey with correct Segment. TotalSegments: 2 Segment: 1
com.amazonaws.http.AmazonHttpClient.handleErrorResponse (AmazonHttpClient.java:679)
You're saying that scan-parallel
handles the segment args automatically, but the :last-prim-kvs
will be different for each segment. I can see that it could deduce which :last-prim-kvs
should go to which segment, if I could pass a vector of :last-prim-kvs
maps, but I don't seem to be able to pass a vector. And besides, the pages are not deterministic, it seems, so I fear that scan-parallel
can not be used with paging.
Hi, closing this - assuming it's gone stale?
I couldn't get it to work, but I still don't know if I did something wrong or if there is something missing in faraday.
Yeah, sorry - I'm actually not using DynamoDB myself at the moment. Not sure off hand, and don't have any test dbs handy to look into this quickly. Would need to spend some proper time to dig into the DDB docs + API to confirm: may be a DDB limitation, or a Faraday limitation that needs fixing.
Will reopen in case I do find some time in future, or someone else has some input.
Really sorry to leave you hanging on this, wasn't intentional.
Quick Google yielded this: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html
In a parallel scan, a Scan request that includes ExclusiveStartKey must specify the same segment whose previous Scan returned the corresponding value of LastEvaluatedKey.
So it seems like parallel scans should be pageable, but Faraday's scan implementation would need some work to allow this to be automatic. Have made a TODO note in the code though realistically don't think I'll personally have time to look into this near-term.
You may be able to use scan
directly and feed it the necessary parallel segment info; not sure how tricky that'd be to do.
PRs super welcome if you (or anyone else) feels like taking a stab at this!
Cheers :-)
Is this something worth fixing for a Faraday noob? Are you open for reviewing PR on this? Any thoughts about a reasonable solution direction?
I think it would be nice to get this fixed @barkanido given we have the beginning of an implementation, so I expect PR's would be welcomed by the community.
Having said that however, you could probably just manage the threads and paging yourself and just call scan
directly, and personally, that's the approach I'd prefer here.
I have an implementation of a lazy-paged-query
that will manage query paging automatically, and I'm glad that it was easy to build on top of Faraday, but don't feel that it should be part of the API. Handling thread-pools and paging of large result-sets of scans fits into the same category for me. Of course I don't speak for the community here at all though, it's just my opinion!
@kipz fair enough. Maybe your example deserve a place here as an example people can refer to. Or even in the README. Just a thought. Anyway I was just looking for a way to contribute and found this issue. Maybe a task from the TODO is of higher priority?
@joelittlejohn what are your thoughts on all this?
@barkanido Re your question about whether this ticket is a good one for a Faraday noob to tackle, it's probably not :slightly_smiling_face: The existing paging implementation is one of the most complex parts of Faraday and as @kipz mentions people have often found that they prefer to avoid the paging feature altogether and implement their own solution (over which they have more control) outside Faraday.
Is this a feature you need or were you just interested in making a useful contribution? I think the most useful thing to be done for Faraday is better documentation. Better docstrings and/or I think it would be very useful to have a list of examples that show real-world usage covering all typical ways to use Faraday's functions.
For a Faraday noob that wants to contribute something useful, I recommend using the library for a while on a few projects and over time you will inevitably uncover something you'd like but is missing.