Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

I got my LPIC-3 in 2011 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh the knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re -read too. In this book Jeroen’s uses GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

Why are Linux Capacity, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Capacity Planning

Capacity planning is about measure resources, troubleshoot resource problems and plan future growth. There are many tools to measure:

  • RAM usage: top, htop, free
  • Load: top, htop
  • Disk stats: sar, iostat, iotop
  • Network: iftop
  • Process duration: time

With all the measures taken several actions can be taken to improve performance, electricity consumption, network usage, etc.

Hard disk vs RAM performance vs buff/cache

From www.condusiv.com we can read the following about RAM and hard disk performance:

A computer is only as fast as its slowest component, and the disk is by far the slowest part of your computer. It is at least 100,000 times slower than your RAM (Random Access Memory) and over 2 million times slower than your CPU. When it comes to the speed of your PC, focus on disk performance first, before you opt for more memory or a faster CPU.

What if instead in focusing in disk performance we focus in avoiding disk access at all or as much as possible. For this we can use Linux’s Kernel buff/cache capabilities.

What is buff/cache

To optimize the performance of a Linux system is it very important that the Page cache is used as much as possible and in a efficient way by all the processes. The more buff/cache is used the less IO disk is needed.

The Page cache definition from www.thegeekdiary.com says:

Page cache is memory held after reading files. Linux kernel prefers to keep unused page cache assuming files being read once will most likely to be read again in the near future, hence avoiding the performance impact on disk IO.

buff/cache can be emptied using the following command (as root):

sync; echo 3 > /proc/sys/vm/drop_caches

Batchs and ETL pipelines

Many Batchs and ETLs processes can be modeled as a serial, parallel or mixed pipelines. We might think that parallel processes will always beat the serials ones by many levels of magnitude but the law of diminishing returns says the opposite depending the resources escalation. See below

A common ETL process will have the following steps:

  • Data downloading.
  • Data unzipping/preparation.
  • Data processing.
  • Data loading/visualization

Sounds familiar right? Two real life examples:

  • A batch of logs that have to be collected, unzipped, processed and loaded into a Database
  • A batch of stock market prices that can to be unzipped, cleaned , processed and finally ingested into a Hadoop cluster for further processing

Law of diminishing returns

From law of diminishing:

The law of diminishing returns states that in all productive processes, adding more of one factor of production, while holding all others constant (“ceteris paribus“), will at some point yield lower incremental per-unit returns.[1] The law of diminishing returns does not imply that adding more of a factor will decrease the total production, a condition known as negative returns, though in fact this is common.

Or in other words, if the resources associated with the process do not scale with the number of processors, then merely adding processors provides even lower returns. As an example if we parallelize a process too much all the CPUs can have IO disk starvation as they all will be fighting for disk/network IO and not getting the required resources for an optimum performance.


Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.