dynamoid
dynamoid copied to clipboard
indexing and ordering/sorting results?
Hey there, forgive me for not totally getting the docs or if i've overlooked, but how would I go about indexing an attribute and ordering/sorting results by that attribute? Lets say I have a simple model for posts, with only a couple attributes:
class Post
include Dynamoid::Document
field :content
field :votes, :integer, { default: 0 }
end
And I want to index the :votes attribute for sorting/ordering by number of votes, or the :content attribute alphabetically, or by the created_at timestamp. Ex: @posts.order('created_at DESC')
Could you give me some clarity? Also, do we need support for "Local secondary index's" as well as "Global Secondary Index's?"
DynamoDB doesn't do general sorting without using what is known as the Range key since the general structure is a Key Value Store.
Quick Overview of Sorting
Thus, your sorting is based on items with the same HashKey. So for example we could have this as a Topics table:
Forum | Topic | PostCount |
---|---|---|
General | Hello World | 10 |
General | For Your Information | 25 |
General | Vote for New Moderator | 13 |
Health and Science | The New Cure for Allergies | 10 |
Health and Science | Gluten, is it good for you? | 3 |
Technology | The New iPhone X | 142 |
If we have Forum as our HashKey and PostCount as our Range key for the entire table then we can get the "topics" sorted in each forum by post count:
Topics.where(forum: "General").order('post_count DESC')
This will return your topics in descending order in the General Forum. And in this case we don't need a Secondary index because our base table has the Range key.
Post Example
So now for your post example, this is a bit harder, I'm not sure what your HashKey would be, but in order to sort, you need a HashKey of the same value across your various rows.
The general Post example would likely be:
Id | Content | Votes |
---|---|---|
1 | Top 10 Ruby Gems You Should Use | 10 |
2 | The Best Tacos in Town | 3 |
3 | Follow These Users For Maximum Success | 13 |
In this case, our HashKey might be Id and the Votes is our Range Key, however, Range is sorted PER HashKey which means, you actually can't query this table and get your posts sorted by votes with this schema.
Instead you would have to add maybe "Site" to your schema and so each Post corresponds to a particular Site.
Id | Content | Votes | Site |
---|---|---|---|
1 | Top 10 Ruby Gems You Should Use | 10 | My Blog |
2 | The Best Tacos in Town | 3 | My Blog |
3 | Follow These Users For Maximum Success | 13 | My Blog |
4 | What Happens Next Will Surprise You | 10 | Jerry's Blog |
5 | Top 10 Conspiracy Theories With Rick | 3 | Jerry's Blog |
In this schema, we now can see we can get our posts sorted by "votes" by having the HashKey be Site and we'd do:
Post.where(site: "My Blog").order("votes DESC")
Again, this HashKey part is important!
Posts Multiple Sorts
Building off this example, since our HashKey is Site we can't have multiple range keys on the table. You can create Local Secondary Indices (which is basically adding more range keys to the table :P) so our schema could be:
Type | Name | HashKey | RangeKey |
---|---|---|---|
Table | posts | Site | Votes |
Local Secondary Index | content_index | Content | |
Local Secondary Index | id_index | Id | |
Local Secondary Index | created_at_index | CreatedAt |
By doing this we can then do queries like:
Post.where(site: "My Blog").order("content DESC")
Post.where(site: "My Blog").order("id DESC")
Post.where(site: "My Blog").order("created_at DESC")
Again, I must repeat, you MUST have same HashKey in order to sort, thus we cannot ask "Give me ALL POSTS sorted by Votes" in this scheme without introducing a new attribute or saying having all Site values be the exact same across all rows (Hope that makes sense)
Local Secondary Indices vs Global Secondary Indices
So the difference is that Local Secondary Indices:
- Must be created when table is created
- Must share the same hash key that your table has
- Must identify the range key to use
Whereas a Global Secondary Index may have a different Hash Key and a different range key.
However, be careful because no partition (determined by HashKey) can have more than 10GB of data (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections.SizeLimit) thus in the example here, all rows with Site as "My Blog" will end up same partition due to same HashKey and all these rows must be contained within 10 GB of space.
Hope this has some clarification for you, let me know if you have any follow-up questions or need clarification on the above examples.
Thanks for writing that up @richardhsu ! Very insightful. I have made this into a basic intro wiki.
Awesome @richardhsu, much appreciated!
Why we don't have order
method anymore?
@rasouza Can you post a code snippet? Are you referring to current master HEAD? @andrykonchin Did this change?
Not sure if there was any method like order
, at least since 2016.
There is only scan_index_forward
method (that means ScanIndexForward Query
option)
@andrykonchin The documentation for how to do sorting is based heavily on an order
method.
Perhaps it was never right?
https://github.com/Dynamoid/Dynamoid/issues/193#issuecomment-330973868
and
https://github.com/Dynamoid/Dynamoid/wiki
@richardhsu can you shed any light on this?
Perhaps it was never right?
@pboling Looks like yes
@rasouza @pboling @andrykonchin Hmm this is very strange, I must have had a mind blip because I can't seem to find order
method either and going back through my DynamoDB projects, I never used order
, but was doing things through scan_index_forward
. My apologies for inaccurate documentation though I think the usage of how order should be is correct though and the requirements on secondary indexes still holds true.
We can implement this at least based on such but only works for Query
operations and not Scan
as mentioned by @andrykonchin so that instead of doing something like Post.where(forum: 'General').scan_index_forward
we'd do Post.where(forum: 'General').order('post_count DESC')
or even Post.where(forum: 'General').order(post_count: :desc)
.
Again apologies for incorrect documentation back then. Thoughts on the documentation and getting a method implemented to support the similar ActiveRecord interface?
I like the proposed DSL Post.where(forum: 'General').order(post_count: :desc)
and I am for removing existing public method scan_index_forward
.
But the sorting isn't useful in DynamoDB because user has to specify partition/hash key. That's why actual code will be following:
Post.where(id: 'General').order(post_count: :desc)
Query without partition key of a table/index doesn't work. In this case I guess an exception should be raised. The name of sorted column can be used to find the index which will be used...
@pboling @richardhsu It's a breaking change so I would like to include it in the next major release in August. If nobody has objections I will work on it.
@andrykonchin Yes agreed on proposal. However, since there can be secondary indexes, it isn't required that it be id
which would be for the primary index. So long as the query has the hash key of one of the indices and the ordering is on the range key of that matched index. And yes an exception raised if no valid index is found is proper. Thanks!
I've got
NoMethodError (undefined method 'order' for #<Dynamoid::Criteria::Chain:0x00007ff0de0a0fd8>)
Any suggestion?
@avatarr There is no order
method still. Though you can control order of the result with scan_index_forward
method.
Hi, I have a problem trying to order the results of a dynamodb query.
My table has a partition key (a string 'imei') and a sort_key(also a string 'gps_timestamp')
When I execute this query:
GpsPoint.where('imei.begins_with': "3", 'gps_timestamp.begins_with': Time.now.strftime("%Y-%m-%d")).scan_index_forward(true).record_limit(20)
and this query:
GpsPoint.where('imei.begins_with': "3", 'gps_timestamp.begins_with': Time.now.strftime("%Y-%m-%d")).scan_index_forward(false).record_limit(20)
(changing scan_index_forward to false)
I get the same 20 results, how is it so?
(I always get the first 20 points of the day)
To be clear, each day we receive hundreds of points, so it's not due to a lack of data in the DB.
It seems like scan_index_forward has no effect, since the behavior of the query is the same with or without the call (I tried also not calling it looking for any difference in the results).
Also, removing record_limit(20) has no effect on ordering either.
Am I not understanding something or should I get different results?
In this particular case the Scan operation will be used (as far as the primary key value isn't specified). scan_index_forward
method takes any effect only in the Query operation and ignored in the Scan.
Probably the correct behavior would be to raise error when the Scan is chosen and scan_index_forward
was used.