Implement StreamingReader as a ReaderInterface to directly stream data from a valid URL.

Open LinZhihao-723 opened this issue 1 year ago • 1 comments

References

Description

This PR implements a ReaderInterface that directly streams data from a valid URL using libcurl. It can be used to stream data from a pre-signed S3 URL. It also adds a simple unit test to stream a file from the CLP GitHub repo and compare it with the local file.

Reader Implementation

The reader maintains a buffer pool. Each buffer in the buffer pool has a fixed size (configurable by users). The buffer pool is a large ring buffer; the overall data fetching (downloading) and data reading follows the producer-consumer model. The reader will first create a daemon thread to download the data into a free buffer in the buffer pool. The fetched buffers will be pushed into a fetched buffer queue, and readers will consume fetched buffers from the queue. Most of the time, the fetcher should work on one fetching buffer, and the reader should read from a fetched buffer. The synchronization happens only when a buffer is fully fetched or fully consumed.

Dependency Requirements

The reader relies on libcurl to download data. Therefore, we need to install libcurl in advance:

Ubuntu: we can rely on apt install
MacOS: libcurl is implemented by default
CentOS: The default libcurl installed by yum is too old and it doesn't match the latest API. The latest release (src code) on CentOS is 7.76.1 (https://curl.se/download.html), therefore we will download this and build it from scratch. Notice that libcurl relies on openssl so we need to install openssl using yum.

Validation performed

Applied suggestions from clang-tidy using the same configuration from clp-ffi-py library. One suggestion is skipped: it suggests using std::array when I'm calling make_unique<char[]>(size), which is not feasible since the size is known during the run time. And it's not proper to use std::vector IMO.
Add simple unit tests to stream data from CLP GitHub repo.

Feb 12 '24 15:02 LinZhihao-723

clp clp copied to clipboard

Implement StreamingReader as a ReaderInterface to directly stream data from a valid URL.

References

Description

Reader Implementation

Dependency Requirements

Validation performed

clp
clp copied to clipboard