aws-s3 Find doesn't work on buckets with many thousand items (patch included)

trafficstars

The current find wasn't working for me on a bucket with a few thousand items that I'm using to cache documents. I was basically trying to find a doc, then store it if it didn't exist. But I could never find the document.

If I switched to a brand new bucket with zero items, it saved items properly.

Digging through the find code, it seemed like we didn't find things because we weren't in the first 'chunk' of the bucket.

Attached is a patch that will go through each chunk until it finds the item, or raise a NoSuchKey error if the item isn't found.

It fixed my problem locally.

Jul 19 '10 21:07 grant-olson

Okay, I guess I can't attach...

From a3203122831aadbdd7ec641057c949cd7f65db3b Mon Sep 17 00:00:00 2001
From: Grant Olson 
Date: Mon, 19 Jul 2010 17:02:24 -0400
Subject: [PATCH] Find that works for buckets with many thousand items

---
 lib/aws/s3/object.rb |   50 +++++++++++++++++++++-----------------------------
 1 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/lib/aws/s3/object.rb b/lib/aws/s3/object.rb
index bcdf9e1..95b5296 100644
--- a/lib/aws/s3/object.rb
+++ b/lib/aws/s3/object.rb
@@ -143,41 +143,33 @@ module AWS
         # Returns the object whose key is name in the specified bucket. If the specified key does not
         # exist, a NoSuchKey exception will be raised.
         def find(key, bucket = nil)
-          # N.B. This is arguably a hack. From what the current S3 API exposes, when you retrieve a bucket, it
-          # provides a listing of all the files in that bucket (assuming you haven't limited the scope of what it returns).
-          # Each file in the listing contains information about that file. It is from this information that an S3Object is built.
+          # Bucket results come in chunks, 1000 by default.
+          # if the key isn't in the first chunk, we need to look through
+          # subsequent chunks until we find it.
           #
-          # If you know the specific file that you want, S3 allows you to make a get request for that specific file and it returns
-          # the value of that file in its response body. This response body is used to build an S3Object::Value object. 
-          # If you want information about that file, you can make a head request and the headers of the response will contain 
-          # information about that file. There is no way, though, to say, give me the representation of just this given file the same 
-          # way that it would appear in a bucket listing.
-          #
-          # When fetching a bucket, you can provide options which narrow the scope of what files should be returned in that listing.
-          # Of those options, one is marker which is a string and instructs the bucket to return only object's who's key comes after
-          # the specified marker according to alphabetic order. Another option is max-keys which defaults to 1000 but allows you
-          # to dictate how many objects should be returned in the listing. With a combination of marker and max-keys you can
-          # *almost* specify exactly which file you'd like it to return, but marker is not inclusive. In other words, if there is a bucket
-          # which contains three objects who's keys are respectively 'a', 'b' and 'c', then fetching a bucket listing with marker set to 'b' will only
-          # return 'c', not 'b'. 
-          #
-          # Given all that, my hack to fetch a bucket with only one specific file, is to set the marker to the result of calling String#previous on
-          # the desired object's key, which functionally makes the key ordered one degree higher than the desired object key according to 
-          # alphabetic ordering. This is a hack, but it should work around 99% of the time. I can't think of a scenario where it would return
-          # something incorrect.
-          
           # We need to ensure the key doesn't have extended characters but not uri escape it before doing the lookup and comparing since if the object exists, 
           # the key on S3 will have been normalized
-          key    = key.remove_extended unless key.valid_utf8?
-          bucket = Bucket.find(bucket_name(bucket), :marker => key.previous, :max_keys => 1)
-          # If our heuristic failed, trigger a NoSuchKey exception
-          if (object = bucket.objects.first) && object.key == key
-            object 
-          else 
-            raise NoSuchKey.new("No such key `#{key}'", bucket)
+          key = key.remove_extended unless key.valid_utf8?
+          bkt_name = bucket_name bucket
+          partial_bucket = Bucket.find(bkt_name)
+          
+          while not partial_bucket.nil?
+            last_key = nil
+            partial_bucket.each do |s3object|
+              last_key = s3object.key
+              return s3object if last_key == key.to_s
+            end
+            if partial_bucket.is_truncated
+              partial_bucket = Bucket.find(bkt_name, :marker => last_key)
+            else
+              partial_bucket = nil
+            end
           end
+          
+          raise NoSuchKey.new("No such key `#{key}'", bucket)
         end
         
+
         # Makes a copy of the object with key to copy_key, preserving the ACL of the existing object if the :copy_acl option is true (default false).
         def copy(key, copy_key, bucket = nil, options = {})
           bucket          = bucket_name(bucket)
-- 
1.6.5.1

Jul 19 '10 21:07 grant-olson

I found this article to be helpful in getting over the 1000 limit http://jakanapes.com/blog/2010/11/01/s3s-object-limit/

Feb 03 '12 21:02 dfl

aws-s3 aws-s3 copied to clipboard

Find doesn't work on buckets with many thousand items (patch included)

aws-s3
aws-s3 copied to clipboard