clp
clp copied to clipboard
Implement StreamingReader as a ReaderInterface to directly stream data from a valid URL.
References
Description
This PR implements a ReaderInterface that directly streams data from a valid URL using libcurl
. It can be used to stream data from a pre-signed S3 URL. It also adds a simple unit test to stream a file from the CLP GitHub repo and compare it with the local file.
Reader Implementation
The reader maintains a buffer pool. Each buffer in the buffer pool has a fixed size (configurable by users). The buffer pool is a large ring buffer; the overall data fetching (downloading) and data reading follows the producer-consumer model. The reader will first create a daemon thread to download the data into a free buffer in the buffer pool. The fetched buffers will be pushed into a fetched buffer queue, and readers will consume fetched buffers from the queue. Most of the time, the fetcher should work on one fetching buffer, and the reader should read from a fetched buffer. The synchronization happens only when a buffer is fully fetched or fully consumed.
Dependency Requirements
The reader relies on libcurl
to download data. Therefore, we need to install libcurl in advance:
- Ubuntu: we can rely on
apt install
- MacOS:
libcurl
is implemented by default - CentOS: The default
libcurl
installed byyum
is too old and it doesn't match the latest API. The latest release (src code) on CentOS is 7.76.1 (https://curl.se/download.html), therefore we will download this and build it from scratch. Notice thatlibcurl
relies onopenssl
so we need to installopenssl
usingyum
.
Validation performed
- Applied suggestions from
clang-tidy
using the same configuration fromclp-ffi-py
library. One suggestion is skipped: it suggests usingstd::array
when I'm callingmake_unique<char[]>(size)
, which is not feasible since the size is known during the run time. And it's not proper to usestd::vector
IMO. - Add simple unit tests to stream data from CLP GitHub repo.