bentools-etl
bentools-etl copied to clipboard
PHP ETL (Extract / Transform / Load) library with SOLID principles + almost no dependency.
Okay, so you heard about the Extract / Transform / Load pattern and you're looking for a PHP library to do the stuff.
Alright, let's go!
Installation
composer require bentools/etl:^3.0@alpha
Warning: version 3.0 is a complete rewrite and a involves important BC breaks. Don't upgrade from ^2.0
unless you know what you're doing!
Usage
To sum up, you will apply transformations onto an iterable
of any things in order to load them in some place.
Sometimes your iterable
is ready to go, sometimes you just don't need to perform transformations, but anyway you need to load that data somewhere.
Let's start with a really simple example:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Loader\JsonFileLoader;
$data = [
'foo',
'bar',
];
$etl = EtlBuilder::init()
->loadInto(JsonFileLoader::toFile(__DIR__.'/data.json'))
->createEtl();
$etl->process($data);
Basically you just loaded the string ["foo","bar"]
into data.json
. Yay!
Now let's apply a basic uppercase transformation:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Loader\JsonFileLoader;
$data = [
'foo',
'bar',
];
$etl = EtlBuilder::init()
->transformWith(new CallableTransformer('strtoupper'))
->loadInto(JsonFileLoader::factory())
->createEtl();
$etl->process($data, __DIR__.'/data.json'); // You can also set the output file when processing :-)
Didn't you just write the string ["FOO","BAR"]
into data.json
? Yes, you did!
Okay, but what if your source data is not an iterable (yet)? It can be a CSV file or a CSV string, for instance. Here's another example:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Extractor\CsvExtractor;
use BenTools\ETL\Loader\JsonFileLoader;
$data = <<<CSV
country_code,country_name,president
US,USA,"Donald Trump"
RU,Russia,"Vladimir Putin"
CSV;
$etl = EtlBuilder::init()
->extractFrom(new CsvExtractor())
->loadInto(JsonFileLoader::factory(['json_options' => \JSON_PRETTY_PRINT]))
->createEtl();
$etl->process($data, __DIR__.'/data.json');
As you guessed, the following content was just written into presidents.json
:
[
{
"country_code": "US",
"country_name": "USA",
"president": "Donald Trump"
},
{
"country_code": "RU",
"country_name": "Russia",
"president": "Vladimir Putin"
}
]
We provide helpful extractors and loaders to manipulate JSON, CSV, text, and you'll also find a DoctrineORMLoader
for when your transformer yields Doctrine entities.
Because yes, a transformer must return a \Generator
. Why? Because a single extracted item can lead to several output items. Let's take a more sophisticated example:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Extractor\JsonExtractor;
$pdo = new \PDO('mysql:host=localhost;dbname=test');
$input = __DIR__.'/presidents.json';
$etl = EtlBuilder::init()
->extractFrom(new JsonExtractor())
->transformWith(
function ($item) use ($pdo) {
$stmt = $pdo->prepare('SELECT country_code FROM countries WHERE country_code = ?');
$stmt->bindValue(1, $item['country_code'], \PDO::PARAM_STR);
$stmt->execute();
if (0 === $stmt->rowCount()) {
yield ['INSERT INTO countries (country_code, country_name) VALUES (?, ?)', [$item['country_code'], $item['country_name']]];
}
yield ['REPLACE INTO presidents (country_code, president_name) VALUES (?, ?)', [$item['country_code'], $item['president']]];
}
)
->loadInto(
$loader = function (\Generator $queries) use ($pdo) {
foreach ($queries as $query) {
list($sql, $params) = $query;
$stmt = $pdo->prepare($sql);
foreach ($params as $i => $value) {
$stmt->bindValue($i + 1, $value);
}
$stmt->execute();
}
}
)
->createEtl();
$etl->process(__DIR__.'/presidents.json'); // The JsonExtractor will convert that file to a PHP array
As you can see, from a single item, we loaded up to 2 queries.
Your extractors, transformers and loaders can implement ExtractorInterface
, TransformerInterface
or LoaderInterface
as well as being simple callables
.
Skipping items
Each extractor / transformer / loader callback gets the current Etl
object injected in their arguments.
This allows you to ask the ETL to skip an item, or even to stop the whole process:
use BenTools\ETL\Etl;
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Transformer\CallableTransformer;
$fruits = [
'apple',
'banana',
'strawberry',
'pineapple',
'pear',
];
$storage = [];
$etl = EtlBuilder::init()
->transformWith(new CallableTransformer('strtoupper'))
->loadInto(
function ($generated, $key, Etl $etl) use (&$storage) {
foreach ($generated as $fruit) {
if ('BANANA' === $fruit) {
$etl->skipCurrentItem();
break;
}
if ('PINEAPPLE' === $fruit) {
$etl->stopProcessing();
break;
}
$storage[] = $fruit;
}
})
->createEtl();
$etl->process($fruits);
var_dump($storage); // ['APPLE', 'STRAWBERRY']
Events
Now you're wondering how you can hook on the ETL lifecycle, to log things, handle exceptions, ... This library ships with a built-in Event Dispatcher that you can leverage when:
- The ETL starts
- An item has been extracted
- The extraction failed
- An item has been transformed
- Transformation failed
- Loader is initialized (1st item is about to be loaded)
- An item has been loaded
- Loading failed
- An item has been skipped
- The ETL was stopped
- A flush operation was completed
- A rollback operation was completed
- The ETL completed the whole process.
The ItemEvents (on extract, transform, load) will allow you to mark the current item to be skipped, or even handle runtime exceptions. Let's take another example:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\EventDispatcher\Event\ItemExceptionEvent;
$fruits = [
'apple',
new \RuntimeException('Is tomato a fruit?'),
'banana',
];
$storage = [];
$etl = EtlBuilder::init()
->transformWith(
function ($item, $key) {
if ($item instanceof \Exception) {
throw $item;
}
yield $key => $item;
})
->loadInto(
function (iterable $transformed) use (&$storage) {
foreach ($transformed as $fruit) {
$storage[] = $fruit;
}
})
->onTransformException(
function (ItemExceptionEvent $event) {
echo $event->getException()->getMessage(); // Is tomato a fruit?
$event->ignoreException();
})
->createEtl();
$etl->process($fruits);
var_dump($storage); // ['apple', 'banana']
Here, we intentionnally threw an exception during the transform operation. But thanks to the event dispatcher, we could tell the ETL this exception can be safely ignored and it can pursue the rest of the process.
You can attach as many event listeners as you wish, and sort them by priority.
Recipes
A recipe is an ETL pattern that can be reused through different tasks. If you want to log everything that goes through an ETL for example, use our built-in Logger recipe:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Recipe\LoggerRecipe;
$etl = EtlBuilder::init()
->useRecipe(new LoggerRecipe($logger))
->createEtl();
You can also create your own recipes:
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\EventDispatcher\Event\ItemEvent;
use BenTools\ETL\Extractor\JsonExtractor;
use BenTools\ETL\Loader\CsvFileLoader;
use BenTools\ETL\Recipe\Recipe;
class JSONToCSVRecipe extends Recipe
{
/**
* @inheritDoc
*/
public function updateBuilder(EtlBuilder $builder): EtlBuilder
{
return $builder
->extractFrom(new JsonExtractor())
->loadInto($loader = CsvFileLoader::factory(['delimiter' => ';']))
->onLoaderInit(
function (ItemEvent $event) use ($loader) {
$loader::factory(['keys' => array_keys($event->getItem())], $loader);
})
;
}
}
$builder = EtlBuilder::init()->useRecipe(new JSONToCSVRecipe());
$etl = $builder->createEtl();
$etl->process(__DIR__.'/presidents.json', __DIR__.'/presidents.csv');
The above example will result in presidents.csv
containing:
country_code;country_name;president
US;USA;"Donald Trump"
RU;Russia;"Vladimir Putin"
To sum up, a recipe is a kind of an ETLBuilder
factory, but keep in mind that a recipe will only add event listeners to the existing builder but can also replace the builder's extractor, transformer and/or loader.
Tests
./vendor/bin/phpunit
License
MIT