Provide a mechanism to save bucket pagination ranges
This is motivated by #5703. Applications may need to persist the state of pagination iterators, for example, because they are implementing a web service that returns pages of results.
I suggested this API:
struct Page {
std::vector<std::string> bucket_names;
std::string reader_state;
};
Page BucketsPage(gcs::Client client, std::string reader_state) {
auto reader = client.ListBuckets(gcs::RestoreBucketReader(
reader_state /* an empty string starts a new reader */));
std::vector<std::string> names;
for (auto& b : reader) {
if (!b || names.size() >= 100) break;
names.push_back(b->name());
}
return Page{std::move(names), std::move(reader).Checkpoint()};
}
I think that this problem applies to any pagination -> iterator mapping.
@coryan Hello! I would like to take this as first issue to get involved. Does anybody solve this problem? Can I take this?
@Ayrtat nobody is working on this at the moment. If you are interested, I recommend you make sure you can run unit tests and integration tests locally:
https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-running-ci-builds-locally.md
https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-setup-development-workstation.md
We can then discuss how the Checkpoint() and RestoreBucketRead() would look like.
@coryan Hello, sorry for late response! I followed the instructions and have run up CI tests:
100% tests passed, 0 tests failed out of 336
Very cool. I guess at this point we need to start figuring out how these functions would look like. Let me start by saying that my proposal is going to seem very complicated. The additional convolutions are to keep the public API as narrow as possible. Every function and type in the public API is a thing we need to support more-or-less forever, so we try to create as few as possible, sometimes jumping through some hoops to make it so.
Also, if you can think of a better design, please do speak up! I have no monopoly on good ideas. Just keep in mind the costs of public APIs.
I imagine we would start with something like:
#include "google/cloud/storage/internal/complex_options.h"
namespace google::cloud::storage { // C++17 for exposition-only
class PaginationToken;
namespace internal {
std::string GetTokenValue(PaginationToken const&);
PaginationToken MakePaginationToken(std::string value);
}
class PaginationToken {
public:
// This class should be opaque, application developers should not be able to create them, or access the internal state.
// the only way to get one is calling
private:
friend GetTokenValue;
friend MakePaginationToken;
explicit PaginationToken(std::string value)
std::string value_;
};
struct PaginationState : public ComplexOption<PaginationState, std::string> {
// look at any of the other complex options for details
} ;
}
Then we would need to add this as a valid option to BucketsRequest:
https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/storage/internal/bucket_requests.h#L37-L39
And use that option to initialize token_ (if the option is present).
We also need to create the CheckPoint() function. Maybe here:
https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/storage/list_buckets_reader.h#L28-L30
defined as:
PaginationToken Checkpoint(ListBucketsReader&& reader); // will require move, this is a good thing.
Which may require adding another function (in the internal namespace) to extract the token value from:
https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/internal/pagination_range.h#L66-L67
Let me know if you have any questions.
I studied the code and already got few points (I hope so...)
As I understood -- we can get pagination from reader, like ListBucketReader but we also need to set ListBucketsRequest to get a response and a token. Also we need to set ListBucketsResponse for page extractor (get items from response). Fix me, please, if I am wrong
So, are you going to incapsulate token in PaginationToken?
Could you also say about checkpoint method? Does it snap a reader?
P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)
I studied the code and already got few points (I hope so...)
Great.
As I understood -- we can get pagination from reader, like
ListBucketReaderbut we also need to setListBucketsRequestto get a response and a token. Also we need to setListBucketsResponsefor page extractor (get items from response). Fix me, please, if I am wrong
I think your reading is correct.
So, are you going to incapsulate
tokeninPaginationToken?
That is the plan.
Could you also say about
checkpointmethod? Does it snap a reader?
I think it will be something like this:
PaginationToken Checkpoint(ListBucketsReader&& reader) {
return PaginationToken(std::move(reader).CurrentToken());
}
But as I write this I realize there are some boundary conditions we need to deal with:
- What happens to the data already pre-loaded into the reader?
- What happens when the reader has not fetched all the data? What is the correct token value.
- Do we need this extra level of indirection? Why not make
Checkpointa member function ofPagedStreamReader?
Some of these questions may seem unimportant to you, or not help you. I want to record them here for future reference anyway.
Let's tackle each in turn, because I think the are in order of difficulty:
What happens to the data already pre-loaded into the reader?
That suggests we need to change the return value to something like:
std::pair<PaginationToken, std::vector<BucketMetadata>> Checkpoint(ListBucketsReader&& reader);
this makes it impossible for the application to "forget" about any data already in the reader.
What happens when the reader has not fetched all the data? What is the correct token value.
That suggests we need to change PaginationToken to:
class PaginationToken {
public:
// This class should be opaque, application developers should not be able to create them, or access the internal state.
// the only way to get one is calling
private:
friend GetTokenValue;
friend MakePaginationToken;
explicit PaginationToken(std::string value);
explicit PaginationToken(abls::nullopt_t)
abls::optional<std::string> value_;
};
That allows us to use an empty string as the marker for "no page loaded yet" (it always is in the google services), and to use an empty optional as the marker for "all pages loaded, there is no more data".
Do we need this extra level of indirection? Why not make Checkpoint a member function of PagedStreamReader?
I think the answer is "yes", but the rationale is a bit complicated:
- We do not want to force users of the library to read the
internal::PageStreamReader<>documentation, it is an internal class. - Maybe we do not want to expose checkpointing for all readers.
P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)
This is a new feature. There is no design for a feature that does not exist, you and I are creating the design in this conversation.
PS: I apologize in advance if you know most, or some of the stuff I am explaining. I just don't know your level of expertise so I am assuming nothing. If you find that (a) I skipped some steps, please do ask for more details, or (b) if I am telling you stuff you know, please tell me to stop.
@coryan The more you tell me the better it is. To be honest, I have no any expertise in this project but I would like to study this project and make some contribution. I am interested in it! I need some time to study it and soon I will have questions! :)
What would I like to propose:
- Allow to
StreamRangekeeps polymorphic functional object, i.e. eitherstd::functionor object of class/sturcture. So, we need to introduce optional template argument:
template <typename T>
struct IdleFunctionalObject {
// This shuts up the compiler
auto operator()() { return T{}; }
};
// Defined below.
template <typename T, typename FunctorT = IdleFunctionalObject<T>>
class StreamRange;
- We can create a range in two ways:
template <typename U>
friend StreamRange<U> internal::MakeStreamRange(internal::StreamReader<U>);
template <typename U, typename FunctorT>
friend StreamRange<U, FunctorT> internal::MakeStreamRangeState(FunctorT reader);
- Make
reader_withinStreamRangepolymorphic
explicit StreamRange(internal::StreamReader<T> reader)
: var_reader_(reader) {
Next();
}
explicit StreamRange(FunctorT state_reader_)
: var_reader_(state_reader_){
Next();
}
// experemintal. var_reader_ can be std::function-packed functor or object of a class/structure
absl::variant<internal::StreamReader<T>, FunctorT> var_reader_;
- Invoke
operator()()from the polymorphic reader:
auto v = (absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
? absl::get<internal::StreamReader<T>>(var_reader_)
: absl::get<FunctorT>(var_reader_))();
// auto v = reader_ ? reader_() : state_reader_();
absl::visit(UnpackVariant{*this}, std::move(v));
- The example of stative functor:
template <typename ReaderT>
struct StativeReader {
StativeReader(ReaderT& reader) : reader_(reader) {}
ReaderT& reader_;
auto operator()() { return reader_.GetNext(); }
};
instead
[reader]() mutable { return reader->GetNext(); }
- This allows to us:
return MakeStreamRangeState<ValueType>(StativeReader<ReaderType>(*reader));
- How can we retrieve reader's state from range?
// experimental
absl::optional<typename FunctorT::reader_type> tryGetReaderState() {
return absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
? std::nullopt
: absl::get<FunctorT>(var_reader_).reader_;
}
- So, here we are
template <typename ReaderT>
PaginationToken checkpoint(ReaderT&& reader) {
/*static_assert(does_type_support_get_current_token_method_v<ReaderT>,
"Reader does not support getCurrentToken method");*/
return PaginationToken::create(
std::move(reader).tryGetReaderState().getCurrentToken());
}
I have already carried some experiments and got some results but I am concerned about design. @coryan, please, can you review this idea and blame/fix me. I hope this idea can be useful!
It seems to be that it would be easier to just not use StreamReader<>, and have separate implementations for paginated vs. non-paginated ranges. Then we can add the additional accessors for the pagination token to the paginated readers.
That is, instead of this:
https://github.com/googleapis/google-cloud-cpp/blob/2214b8c599f4624d5ed0a203d9b4a813ff8f2e96/google/cloud/internal/pagination_range.h#L48-L49
We would write something like:
template <typename T>
class PaginationRange { // no longer related to `StreamRange<T>` !!
public:
iterator begin() { ... }
iterator end() { ... }
std::pair<PaginationToken, std::vector<T>> Checkpoint() && { ... }
};
I think this would be a breaking change, but I would like to hear @devjgm's opinion.
@Ayrtat Are you still working on this?
@Ayrtat Are you still working on this?
Hi! No, I am not. You can take this take if you would like to :)
Hello @coryan , Is the PR done for this issue? Otherwise, I would like to work on this issue.
There is no PR for this, though @remyabel expressed an interest.
I don't mind if someone else was working on it. Just keeping it on the table incase I get around to it and didn't want to duplicate effort.
Yes @remyabel , I would start working on this, and update here with progress.
This may be easier once I introduce a StorageConnection to implement mocks.
We do not have time to work on this, closing.