Implement StreamingReader as a ReaderInterface to directly stream data from a valid URL.
References
Description
This PR implements a ReaderInterface that directly streams data from a valid URL using libcurl. It can be used to stream data from a pre-signed S3 URL. It also adds a simple unit test to stream a file from the CLP GitHub repo and compare it with the local file.
Reader Implementation
The reader maintains a buffer pool. Each buffer in the buffer pool has a fixed size (configurable by users). The buffer pool is a large ring buffer; the overall data fetching (downloading) and data reading follows the producer-consumer model. The reader will first create a daemon thread to download the data into a free buffer in the buffer pool. The fetched buffers will be pushed into a fetched buffer queue, and readers will consume fetched buffers from the queue. Most of the time, the fetcher should work on one fetching buffer, and the reader should read from a fetched buffer. The synchronization happens only when a buffer is fully fetched or fully consumed.
Dependency Requirements
The reader relies on libcurl to download data. Therefore, we need to install libcurl in advance:
- Ubuntu: we can rely on
apt install - MacOS:
libcurlis implemented by default - CentOS: The default
libcurlinstalled byyumis too old and it doesn't match the latest API. The latest release (src code) on CentOS is 7.76.1 (https://curl.se/download.html), therefore we will download this and build it from scratch. Notice thatlibcurlrelies onopensslso we need to installopensslusingyum.
Validation performed
- Applied suggestions from
clang-tidyusing the same configuration fromclp-ffi-pylibrary. One suggestion is skipped: it suggests usingstd::arraywhen I'm callingmake_unique<char[]>(size), which is not feasible since the size is known during the run time. And it's not proper to usestd::vectorIMO. - Add simple unit tests to stream data from CLP GitHub repo.
Can you resolve the conflicts?
Made a few changes to make the code more maintainable:
- wrap curl using C++ classes to provide better error handling and cleaner object lifetime management
- using
clp::Threadto wrapstd::threadinstead
Can u help review the changes above? @kirkrodrigues @wraymo
Do you consider moving curl-related functions and classes into a separate file?
I think for now we don't have to since this is the only file using the curl-related classes. If we have such a need later, we can move them out.
How about:
core: Add NetworkReader to stream data from a URL. (#278)
Uisng core: Add NetworkReader to stream data from a URL using libcurl. instead to indicate we add libcurl