clp icon indicating copy to clipboard operation
clp copied to clipboard

Implement StreamingReader as a ReaderInterface to directly stream data from a valid URL.

Open LinZhihao-723 opened this issue 1 year ago • 1 comments

References

Description

This PR implements a ReaderInterface that directly streams data from a valid URL using libcurl. It can be used to stream data from a pre-signed S3 URL. It also adds a simple unit test to stream a file from the CLP GitHub repo and compare it with the local file.

Reader Implementation

The reader maintains a buffer pool. Each buffer in the buffer pool has a fixed size (configurable by users). The buffer pool is a large ring buffer; the overall data fetching (downloading) and data reading follows the producer-consumer model. The reader will first create a daemon thread to download the data into a free buffer in the buffer pool. The fetched buffers will be pushed into a fetched buffer queue, and readers will consume fetched buffers from the queue. Most of the time, the fetcher should work on one fetching buffer, and the reader should read from a fetched buffer. The synchronization happens only when a buffer is fully fetched or fully consumed.

Dependency Requirements

The reader relies on libcurl to download data. Therefore, we need to install libcurl in advance:

  1. Ubuntu: we can rely on apt install
  2. MacOS: libcurl is implemented by default
  3. CentOS: The default libcurl installed by yum is too old and it doesn't match the latest API. The latest release (src code) on CentOS is 7.76.1 (https://curl.se/download.html), therefore we will download this and build it from scratch. Notice that libcurl relies on openssl so we need to install openssl using yum.

Validation performed

  1. Applied suggestions from clang-tidy using the same configuration from clp-ffi-py library. One suggestion is skipped: it suggests using std::array when I'm calling make_unique<char[]>(size), which is not feasible since the size is known during the run time. And it's not proper to use std::vector IMO.
  2. Add simple unit tests to stream data from CLP GitHub repo.

LinZhihao-723 avatar Feb 12 '24 15:02 LinZhihao-723