PipelineDP
PipelineDP copied to clipboard
Basic tutorial notebook extended
Description
This PR extends and complements the basic tutorial on Apache Beam by integrating very thorough explanations and references to the preparatory execution framework for the exercises, as well as to Apache Beam programming model and main design components.
The notebook has been optimised for Google Colab, making extensive use of forms to hide tedious and irrelevant details (mostly related to the setup, and to hide hints and solutions for the exercise parts.
Conversely, all the coding bits which are considered relevant to the learning journey into apache_beam
features have been intentionally left visible.
A list of a few but the most relevant features that have been included in the PR:
- The
Netflix-Prize
dataset is now directly downloaded from Kaggle usingkaggle
official Python API - The default size of the dataset is fixed to
10K
lines, but it could be easily customised - Data objects and custom
PTransformers
have been slightly optimised (in terms of Python code) as well as made compliant with Pythonmultiprocessing
execution environment- This particularly affects the use of custom
dataclass
objects (saved into a separate Python module) and correspondingbeam.coders.Coder
implementation
- This particularly affects the use of custom
- Execution of the
beam.Pipeline
is configure to exploit (as default) themultiprocessing
Python environment to leverage on the multiple cores - if running this notebook on a larger version of the dataset and/or in an environment with more than2 CPUs
(as in default colab). - A few changes in the formatting in the text of exercises has been applied;
- The code in the provided solutions for the exercises has been adapted to the new execution framework setup in the first part of the notebook.
- Very thorough documentation reference and quite detailed explanation of the analysed dataset, and of all the features and component pf
apache_beam
used throughout the notebook. - (last but not least) The tutorial has been prepared to run smoothly on both (local/standard) Jupyter notebook, as well as on Google Colab.
Affected Dependencies
None - all the required packages are installed automatically in the notebook. This includes apache_beam
and kaggle
(which is already available in Colab, but needs configuring).
How has this been tested?
- The notebook has been tested on Google Colab, as well as on local Jupyter notebook server on a MacBook Pro laptop with 8 cores.
- A few performance tests have been also carried out (on both the laptop and in colab) to compare the amount of time required by the multiple supported execution modes on increasing dataset sizes (from
10K
to10M
lines). Reproducing those tests is quite easy considering the fast dataset size selection as dropdown list.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.
Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.
Hi @dvadym thanks a lot for the feedback: glad you appreciated.
Following up from a quick chat that I had on slack with @chinmayshah99 I am also going to share here a few more details about some performance tests I ran on my laptop (MacBook Pro ,8
cores) on increasing sizes of the datasets (and different direct_running_mode
).
On Colab, on the other hand, the different running mode do not make any difference, as the default VM is offering only 2
cores (so no real gain by parallelisation of tasks).
However, I worked on the tutorial assuming that the notebook could be executed either on Colab or in a Jupyter session. Besides, I personally think it is a good bit to include in the tutorial for a framework like Apache Beam (at least, it was fun for me to explore a little bit the internals and the execution framework, while working on the PR) :)
Looking forward to receiving your feedbacks 🙌
As promised, attached to this PR, I am also sharing the results of a few experiments I tried on my laptop by running the notebook with the three direct_running_mode
configurations, and multiple dataset sizes.
Premise
Those experiments were originally motivated by my intention to find the best combination of those parameters that led to reasonable running time on Colab. However, as already pointed out in my previous comment, including the support for different direct_running_mode
on Colab is quite pointless as the (very limited) 2 cores
on the default VM instance don't make appreciate any difference, nor gain in performance.
However, the scenario is completely different when running the notebook in a (slightly) better-equipped computing environment. Gathered results will allow to derive a quite interesting pattern IMHO, as they also motivate my re-implementation of the run_pipeline
function to allow for customisations.
Experiments
Case 1: 1M lines
Dataset (~1%
of total size)
In the following picture, the different running times for the first count_all_views
function executed with the three direct_running_mode
configurations
This very first experiment shows some preliminary gain in performance of the multiprocessing
-based approach vs the single-process
(in memory) approach.
Interestingly, the multi_threading
mode takes even more than in_memory
- and that makes total sense as there is presumably extra over-head in communication of the multiple threads.
Case 2: 10M lines
Dataset (~10%
of total size)
With a dataset size of 10M
lines the scenario changed a bit, observing another interesting effect of the multiple execution modes.
The leap in performance of the two execution mode becomes substantially more evident with a dataset size that is 10x
bigger.
However, in the case of multi-threaded
case:
The execution did not complete successfully, with an error message in the grpc
backend as "too many pings".
I tried to dig a little bit in the issue and the only thing I could find was a (closed) issue on grpc
#2444 specifically mentioning python client and grpc 1.8
. In my environment, I am currently using grpc 1.36.1
To my understanding, the issue is generated by some processing not yet completed which makes the worker threads
going in timeout
.
In more details: for what I understood about the general execution framework in Apache Beam, the main worker process always spawns auxiliary threads (i.e. pthreads
at C-level via grpc
) to handle parallel (I/O?) operations on the parallel collection(s). And this is still true, regardless of the selected direct_running_mode
.
Digging a little bit into the PipelineOptions
, I've discovered a job-server-timeout
which (supposedly) controls the timeout of those job
worker threads - set to 60 secs
by default.
To further corroborate this hypothesis, it has to be said that the same issue also happens with the multi_processing
running mode. In fact, I had to change the default of this parameter to 65,536
, to circumvent this issue from happening with multi_processing
in the former 1M
lines dataset.
In this particular case (10M-lines
dataset) that value was still enough for multi_processing
and in_memory
but not for multi_threading
for which I had to increase that value by 2/4x
to complete the execution (see below)
AGAIN (as expected) multi_threading
was slightly worse than in_memory
since Python doesn't like CPU-bound threads, and extra time is wasted in communication.
Considerations and Take-Away Messages
After these two experiments, and looking at the performance gain with the multiple configurations, I think it would be fair to conclude that execution time scales up linearly w.r.t. the dataset sizes. So in a multi_processing
fashion:
1M ==> 11 s
10M ==> 1 m
100M ==> 10 m
whereas with multi-threading
the whole completion time would presumably sum up to ~1 h
in the 100M
lines (almost full dataset).
Sorry, I haven't had yet time
Sorry, I haven't had yet time
No problem at all, I totally understand.
I have been meaning to issue another PR (as a follow up on this one) to align the second notebook of the tutorial, but I could not finish that either.
Maybe it would be also useful to get the feedbacks on this one first, and then finishing the other notebook - anyway :)
Yeah, I think it's definetely worth to review the 1st Colab first and then to make changes in 2nd Colab.
Thanks again for contributing!