google-cloud-cpp Provide a mechanism to save bucket pagination ranges

This is motivated by #5703. Applications may need to persist the state of pagination iterators, for example, because they are implementing a web service that returns pages of results.

I suggested this API:

struct Page {
  std::vector<std::string> bucket_names;
  std::string reader_state;
};

Page BucketsPage(gcs::Client client, std::string reader_state) {
  auto reader = client.ListBuckets(gcs::RestoreBucketReader(
      reader_state /* an empty string starts a new reader */));
  std::vector<std::string> names;
  for (auto& b : reader) {
    if (!b || names.size() >= 100) break;
    names.push_back(b->name());
  }
  return Page{std::move(names), std::move(reader).Checkpoint()};
}

I think that this problem applies to any pagination -> iterator mapping.

Jan 25 '21 14:01 coryan

@coryan Hello! I would like to take this as first issue to get involved. Does anybody solve this problem? Can I take this?

Jun 28 '21 20:06 Ayrtat

@Ayrtat nobody is working on this at the moment. If you are interested, I recommend you make sure you can run unit tests and integration tests locally:

https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-running-ci-builds-locally.md

https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-setup-development-workstation.md

We can then discuss how the Checkpoint() and RestoreBucketRead() would look like.

Jun 28 '21 20:06 coryan

@coryan Hello, sorry for late response! I followed the instructions and have run up CI tests:

100% tests passed, 0 tests failed out of 336

Jul 01 '21 21:07 Ayrtat

Very cool. I guess at this point we need to start figuring out how these functions would look like. Let me start by saying that my proposal is going to seem very complicated. The additional convolutions are to keep the public API as narrow as possible. Every function and type in the public API is a thing we need to support more-or-less forever, so we try to create as few as possible, sometimes jumping through some hoops to make it so.

Also, if you can think of a better design, please do speak up! I have no monopoly on good ideas. Just keep in mind the costs of public APIs.

I imagine we would start with something like:

#include "google/cloud/storage/internal/complex_options.h"

namespace google::cloud::storage { // C++17 for exposition-only
class PaginationToken;
namespace internal {
std::string GetTokenValue(PaginationToken const&);
PaginationToken MakePaginationToken(std::string value);
}

class PaginationToken {
 public:
  // This class should be opaque, application developers should not be able to create them, or access the internal state.
  // the only way to get one is calling 
 private:
  friend GetTokenValue;
  friend MakePaginationToken; 

  explicit PaginationToken(std::string value)

  std::string value_;
};

struct PaginationState : public ComplexOption<PaginationState, std::string> {
  // look at any of the other complex options for details
} ;
}

Then we would need to add this as a valid option to BucketsRequest:

https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/storage/internal/bucket_requests.h#L37-L39

And use that option to initialize token_ (if the option is present).

We also need to create the CheckPoint() function. Maybe here:

https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/storage/list_buckets_reader.h#L28-L30

defined as:

PaginationToken Checkpoint(ListBucketsReader&& reader); // will require move, this is a good thing.

Which may require adding another function (in the internal namespace) to extract the token value from:

https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/internal/pagination_range.h#L66-L67

Let me know if you have any questions.

Jul 02 '21 13:07 coryan

I studied the code and already got few points (I hope so...) As I understood -- we can get pagination from reader, like ListBucketReader but we also need to set ListBucketsRequest to get a response and a token. Also we need to set ListBucketsResponse for page extractor (get items from response). Fix me, please, if I am wrong

So, are you going to incapsulate token in PaginationToken? Could you also say about checkpoint method? Does it snap a reader?

P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)

Jul 02 '21 22:07 Ayrtat

I studied the code and already got few points (I hope so...)

Great.

As I understood -- we can get pagination from reader, like ListBucketReader but we also need to set ListBucketsRequest to get a response and a token. Also we need to set ListBucketsResponse for page extractor (get items from response). Fix me, please, if I am wrong

I think your reading is correct.

So, are you going to incapsulate token in PaginationToken?

That is the plan.

Could you also say about checkpoint method? Does it snap a reader?

I think it will be something like this:

PaginationToken Checkpoint(ListBucketsReader&& reader) {
  return PaginationToken(std::move(reader).CurrentToken());
}

But as I write this I realize there are some boundary conditions we need to deal with:

What happens to the data already pre-loaded into the reader?
What happens when the reader has not fetched all the data? What is the correct token value.
Do we need this extra level of indirection? Why not make Checkpoint a member function of PagedStreamReader?

Some of these questions may seem unimportant to you, or not help you. I want to record them here for future reference anyway.

Let's tackle each in turn, because I think the are in order of difficulty:

What happens to the data already pre-loaded into the reader?

That suggests we need to change the return value to something like:

std::pair<PaginationToken, std::vector<BucketMetadata>> Checkpoint(ListBucketsReader&& reader);

this makes it impossible for the application to "forget" about any data already in the reader.

What happens when the reader has not fetched all the data? What is the correct token value.

That suggests we need to change PaginationToken to:

class PaginationToken {
 public:
  // This class should be opaque, application developers should not be able to create them, or access the internal state.
  // the only way to get one is calling 
 private:
  friend GetTokenValue;
  friend MakePaginationToken; 

  explicit PaginationToken(std::string value);
  explicit PaginationToken(abls::nullopt_t)

  abls::optional<std::string> value_;
};

That allows us to use an empty string as the marker for "no page loaded yet" (it always is in the google services), and to use an empty optional as the marker for "all pages loaded, there is no more data".

Do we need this extra level of indirection? Why not make Checkpoint a member function of PagedStreamReader?

I think the answer is "yes", but the rationale is a bit complicated:

We do not want to force users of the library to read the internal::PageStreamReader<> documentation, it is an internal class.
Maybe we do not want to expose checkpointing for all readers.

P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)

This is a new feature. There is no design for a feature that does not exist, you and I are creating the design in this conversation.

PS: I apologize in advance if you know most, or some of the stuff I am explaining. I just don't know your level of expertise so I am assuming nothing. If you find that (a) I skipped some steps, please do ask for more details, or (b) if I am telling you stuff you know, please tell me to stop.

Jul 03 '21 13:07 coryan

@coryan The more you tell me the better it is. To be honest, I have no any expertise in this project but I would like to study this project and make some contribution. I am interested in it! I need some time to study it and soon I will have questions! :)

Jul 03 '21 15:07 Ayrtat

What would I like to propose:

Allow to StreamRange keeps polymorphic functional object, i.e. either std::function or object of class/sturcture. So, we need to introduce optional template argument:

template <typename T>
struct IdleFunctionalObject {
 // This shuts up the compiler
 auto operator()() { return T{}; }
};

// Defined below.
template <typename T, typename FunctorT = IdleFunctionalObject<T>>
class StreamRange;

We can create a range in two ways:

template <typename U>
friend StreamRange<U> internal::MakeStreamRange(internal::StreamReader<U>);

template <typename U, typename FunctorT>
friend StreamRange<U, FunctorT> internal::MakeStreamRangeState(FunctorT reader);

Make reader_ within StreamRange polymorphic

  explicit StreamRange(internal::StreamReader<T> reader)
      : var_reader_(reader) {
    Next();
  }
  explicit StreamRange(FunctorT state_reader_)
      : var_reader_(state_reader_){
    Next();
  }
  // experemintal. var_reader_ can be std::function-packed functor or object of a class/structure
  absl::variant<internal::StreamReader<T>, FunctorT> var_reader_;

Invoke operator()() from the polymorphic reader:

auto v = (absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
                  ? absl::get<internal::StreamReader<T>>(var_reader_)
                  : absl::get<FunctorT>(var_reader_))();

    // auto v = reader_ ? reader_() : state_reader_();
    absl::visit(UnpackVariant{*this}, std::move(v));

The example of stative functor:

template <typename ReaderT>
struct StativeReader {
  StativeReader(ReaderT& reader) : reader_(reader) {}

  ReaderT& reader_;
  auto operator()() { return reader_.GetNext(); }
};

instead

[reader]() mutable { return reader->GetNext(); }

This allows to us:

return MakeStreamRangeState<ValueType>(StativeReader<ReaderType>(*reader));

How can we retrieve reader's state from range?

// experimental 
absl::optional<typename FunctorT::reader_type> tryGetReaderState() {
    return absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
               ? std::nullopt
               : absl::get<FunctorT>(var_reader_).reader_;
  }

So, here we are

template <typename ReaderT>
PaginationToken checkpoint(ReaderT&& reader) {
  /*static_assert(does_type_support_get_current_token_method_v<ReaderT>,
                "Reader does not support getCurrentToken method");*/

  return PaginationToken::create(
      std::move(reader).tryGetReaderState().getCurrentToken());
}

I have already carried some experiments and got some results but I am concerned about design. @coryan, please, can you review this idea and blame/fix me. I hope this idea can be useful!

Jul 04 '21 12:07 Ayrtat

It seems to be that it would be easier to just not use StreamReader<>, and have separate implementations for paginated vs. non-paginated ranges. Then we can add the additional accessors for the pagination token to the paginated readers.

That is, instead of this:

https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/internal/pagination_range.h#L48-L49

We would write something like:

template <typename T>
class PaginationRange { // no longer related to `StreamRange<T>` !!
 public:
   iterator begin() { ... }
   iterator end() { ... }

   std::pair<PaginationToken, std::vector<T>> Checkpoint() && { ... }
};

I think this would be a breaking change, but I would like to hear @devjgm's opinion.

Jul 07 '21 14:07 coryan

@Ayrtat Are you still working on this?

Aug 29 '21 05:08 ghost

@Ayrtat Are you still working on this?

Hi! No, I am not. You can take this take if you would like to :)

Aug 29 '21 17:08 Ayrtat

Hello @coryan , Is the PR done for this issue? Otherwise, I would like to work on this issue.

Sep 12 '21 05:09 AditiThirdEye

There is no PR for this, though @remyabel expressed an interest.

Sep 14 '21 14:09 coryan

I don't mind if someone else was working on it. Just keeping it on the table incase I get around to it and didn't want to duplicate effort.

Sep 14 '21 14:09 ghost

Yes @remyabel , I would start working on this, and update here with progress.

Sep 14 '21 14:09 AditiThirdEye

This may be easier once I introduce a StorageConnection to implement mocks.

Aug 18 '22 18:08 coryan

We do not have time to work on this, closing.

Jan 25 '23 19:01 coryan