Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

  • Avoiding IO at all was 11,3 times faster
  • Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading

Share