Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Capacity Planning

Capacity planning is about measure resources, troubleshoot resource problems and plan future growth. There are many tools that can be used to to measure:

  • RAM usage: top, htop, free
  • Load: top, htop
  • Disk stats: sar, iostat, iotop
  • Network: iftop
  • Process duration: time

With all the measures at hand, several actions can be taken to improve performance, electricity consumption, network usage, etc.

Hard disk vs RAM performance vs buff/cache

From www.condusiv.com we can read the following about RAM and hard disk performance:

A computer is only as fast as its slowest component, and the disk is by far the slowest part of your computer. It is at least 100,000 times slower than your RAM (Random Access Memory) and over 2 million times slower than your CPU. When it comes to the speed of your PC, focus on disk performance first, before you opt for more memory or a faster CPU.

What if instead in focusing in disk performance we focus in avoiding disk access at all or as much as possible. For this we can use Linux’s Kernel buff/cache capabilities or a proper pipeline architecture. In this post we will see both approaches.

What is buff/cache

To optimize the performance of a Linux system is it very important that the Page cache is used as much as possible and in a efficient way by all the processes. The more buff/cache is used the less IO disk is needed.

The Page cache definition from www.thegeekdiary.com says:

Page cache is memory held after reading files. Linux kernel prefers to keep unused page cache assuming files being read once will most likely to be read again in the near future, hence avoiding the performance impact on disk IO.

buff/cache can be emptied using the following command (as root):

sync; echo 3 > /proc/sys/vm/drop_caches

Batch and ETL pipelines

Many batch and ETLs processes can be modeled as a serial, parallel or mixed pipelines. We might think that parallel processes will always beat the serials ones by many levels of magnitude but the law of diminishing returns says the opposite depending the resources escalation. See below

A common ETL process will have the following steps:

  • Data downloading.
  • Data unzipping/preparation.
  • Data processing.
  • Data loading/visualization

Sounds familiar right? Two real life examples:

  • A bunch of logs that have to be collected, unzipped, processed and loaded into a Database
  • A bunch of stock market ticks compressed files  that have to be unzipped, cleaned , processed and finally ingested into a Hadoop cluster for further processing

Law of diminishing returns

From law of diminishing:

The law of diminishing returns states that in all productive processes, adding more of one factor of production, while holding all others constant (“ceteris paribus“), will at some point yield lower incremental per-unit returns.[1] The law of diminishing returns does not imply that adding more of a factor will decrease the total production, a condition known as negative returns, though in fact this is common.

Or in other words, if the resources associated with the process do not scale with the number of processors, then merely adding processors provides even lower returns. As an example if we parallelize a process too much all the CPUs can have resources starvation as they all will be fighting for disk/network IO and not getting the required resources for an optimum performance.

How important is the proper usage of the buff/cache

In order to illustrate the importance of the buff/cache to improve performance and see the effects of this IO Disk starvation, we are going to launch some basic on-liners commands on a public NASDAQ data set from www.nasdaqtrader.com. Data from 2005 that can be downloaded from their FTP server.

Emptying the buff/cache

First we clean the buff/cache using the command below.

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches
root@mediacenter:~# free -h
total used free shared buff/cache available Mem:
15Gi 8,2Gi 5,9Gi 715Mi 1,6Gi 6,5Gi
Swap: 0B 0B 0B

After cleaning the buff/cacheAt we have 1.6G of memory used for buff/cache.

Getting the “real” Data

The following code will download (using GNU parallel) the data set for 2005.

The computer used for this tests has 4 cores and 16G of RAM therefore parallel wget command will download the files in batches of 4.

parallel wget ftp://ftp.nasdaqtrader.com/symboldirectory/regshopilot/NASDAQsh2005{1}.zip ::: {01..12}

We will have downloaded 12 zip files with 916M of size

juan@mediacenter:~/tmp/post$ ls -lh *.zip
-rw-r--r-- 1 juan juan 71M may 30 22:51 NASDAQsh200501.zip
-rw-r--r-- 1 juan juan 69M may 30 22:52 NASDAQsh200502.zip
-rw-r--r-- 1 juan juan 77M may 30 22:54 NASDAQsh200503.zip
-rw-r--r-- 1 juan juan 76M may 30 22:55 NASDAQsh200504.zip
-rw-r--r-- 1 juan juan 75M may 30 22:57 NASDAQsh200505.zip
-rw-r--r-- 1 juan juan 78M may 30 22:58 NASDAQsh200506.zip
-rw-r--r-- 1 juan juan 75M may 30 23:00 NASDAQsh200507.zip
-rw-r--r-- 1 juan juan 81M may 30 23:01 NASDAQsh200508.zip
-rw-r--r-- 1 juan juan 75M may 30 23:03 NASDAQsh200509.zip
-rw-r--r-- 1 juan juan 84M may 30 23:05 NASDAQsh200510.zip
-rw-r--r-- 1 juan juan 82M may 30 23:06 NASDAQsh200511.zip
-rw-r--r-- 1 juan juan 78M may 30 23:08 NASDAQsh200512.zip
juan@mediacenter:~/tmp/post$ du -sh
916M    .

Please note that while we are downloading the files to local the buff/cache is also being filled. So till the buff/cache is cleaned or overwritten with newer data the access to this files will be quicker.

juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available Mem:
15Gi 8,4Gi 4,8Gi 782Mi 2,5Gi 6,3Gi
Swap: 0B 0B 0B

If we do some math, the 2.5G is the buff/cache used after downloading the files minus the previous value of 1.6G is 0.9G that it the size of all the files.

Simulating getting the data with cat

In order to simulate getting the data from the real source but without consuming www.nasdaqtrader.com‘s bandwidth we are going to use cat to read the files from local filesystem instead from the remote FTP server and putting them in the buff/cache, then we will check that the next accesses take less time by using the time command

juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available
Mem: 15Gi 8,4Gi 5,7Gi 782Mi 1,6Gi 6,3Gi
Swap: 0B 0B 0B
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m6,649s
user 0m0,014s
sys 0m0,738s
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m0,239s
user 0m0,001s
sys 0m0,237s
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m0,211s
user 0m0,001s
sys 0m0,210s
juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available
Mem: 15Gi 8,4Gi 4,8Gi 782Mi 2,5Gi 6,3Gi
Swap: 0B 0B 0B

First cat command took 6.649s while the second and the third attempt took less than 0.240s. Let’s do the math again, reading a file using the buff/cache was almost 28 time faster!!

The buff/cache is the same value of 2.5G – 1.6G = 0.9G Meaning that the cat procedure is simulating perfectly the downloading of the files at buff/cache usage level.

Serial unzip with buff/cache VS parallel unzip without buff/cache

Now that we are comfortable working with buff/cache we are going to test two cases and see how do they perform

Serial unzip with buff/cache

juan@mediacenter:~/tmp/post$ free -h                                       
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,5Gi       5,4Gi       785Mi       1,8Gi       6,1Gi
Swap:            0B          0B          0B                                
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null                     
                                                                           
real    0m6,113s                                                           
user    0m0,012s                                                           
sys     0m0,732s                                                           
juan@mediacenter:~/tmp/post$ free -h                                       
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,5Gi       4,5Gi       785Mi       2,7Gi       6,1Gi                                                             
Swap:            0B          0B          0B                                
juan@mediacenter:~/tmp/post$ time for i in  `ls *.zip`; do unzip -d uncompressed $i; done                             
Archive:  NASDAQsh200501.zip                                                                                          
  inflating: uncompressed/NASDAQsh20050128.txt                             
  inflating: uncompressed/NASDAQsh20050103.txt              
[....]
  inflating: uncompressed/NASDAQsh20051219.txt
  inflating: uncompressed/NASDAQsh20051216.txt
  inflating: uncompressed/NASDAQsh20051215.txt
  inflating: uncompressed/NASDAQsh20051214.txt

real    0m53,171s
user    0m32,330s
sys     0m5,866s

Unzipping in a serial loop the 12 files using buff/cache took 53.171s

Parallel unzip without buff/cache

juan@mediacenter:~/tmp/post$ free -h          
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,6Gi       5,7Gi       785Mi       1,3Gi       6,0Gi
Swap:            0B          0B          0B   
juan@mediacenter:~/tmp/post$ time ls *.zip |parallel  unzip -d uncompressed {}
Archive:  NASDAQsh200501.zip                  
checkdir:  cannot create extraction directory: uncompressed
           File exists                        
Archive:  NASDAQsh200502.zip                  
[...]
  inflating: uncompressed/NASDAQsh20051216.txt
  inflating: uncompressed/NASDAQsh20051215.txt
  inflating: uncompressed/NASDAQsh20051214.txt

real    0m50,119s
user    0m34,300s
sys     0m5,715s

Unzipping in parallel the 12 files reading all the data from disk took 50.119s

Summary

A developer might think that all the job is done and the program should perform better as the parallelization is in place but numbers say otherwise.

Both examples performance are equivalent 53.171s vs 50.119s. It seems that parallel process accessing the disk is equivalent to serial process accessing the buff/cache.

Do not forget that the process also writes to the disk, the unzip command writes the .txt files to the disk. Writing to disk is a more expensive operation than reading. Is it really necessary to write the unzipped files to the disk? Sometimes we don’t want the temporary files to be written.

With this example we have seen how important is the proper usage of the buff/cache and also how IO operations affects general performance.

Real life example Serial task vs parallelized task (using buff/cache)

In a real case example, a serial ETL process could download Terabytes of data to a server with several hundred of RAM Gigabytes. This means that all downloaded files will be written to the disk (while being downloaded) and then read from disk to be unzipped, written again after being unzipped and then read again (while being processed). Yes, you have read that this process will (at least) write two times to the disk and read two times more.

This can be modeled as the serial example we saw before but without using the buff/cache as the buff/cache is being overwritten by new coming data plus intermediate processing task in this case we will use the well know wc -l command to count the lines inside the txt files

Serial unzip without buff/cache plus processing

First we clean the buff/cache

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches

Then we launch the same command than before:

juan@mediacenter:~/tmp/post$ time for i in  `ls *.zip`; do unzip -d uncompressed $i; done
Archive:  NASDAQsh200501.zip                  
  inflating: uncompressed/NASDAQsh20050128.txt
  inflating: uncompressed/NASDAQsh20050103.txt                
  inflating: uncompressed/NASDAQsh20050104.txt
  inflating: uncompressed/NASDAQsh20050105.txt
[...]
  inflating: uncompressed/NASDAQsh20051219.txt  
  inflating: uncompressed/NASDAQsh20051216.txt  
  inflating: uncompressed/NASDAQsh20051215.txt  
  inflating: uncompressed/NASDAQsh20051214.txt

real    1m27,308s
user    0m35,834s
sys     0m6,621s

Once the txt files are there we have to clean again the cache and count the lines

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches

Then we count the lines with the following code:

juan@mediacenter:~/elso$ time for i in ls uncompressed; do wc -l uncompressed/$i; done
641405 uncompressed/NASDAQsh20050103.txt
729480 uncompressed/NASDAQsh20050104.txt
[...]
493279 uncompressed/NASDAQsh20051229.txt
523629 uncompressed/NASDAQsh20051230.txt
real 1m4,668s
user 0m3,384s
sys 0m5,843s

Finally we add both times. 1m27,308s + 1m4,668s = 2m31,976s

Parallel unzip with buff/cache plus processing

In a ideal case we will downloaded the files and process them in parallel as they come in order to use the buff/cache as much as possible. This can be modeled as a parallel process with the buff/cache filled plus the processing part

First we fill the cache

juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null

Now we launch the previous parallel command

juan@mediacenter:~/tmp/post$ time ls *.zip |parallel unzip -d uncompressed {}
Archive: NASDAQsh200502.zip
checkdir: cannot create extraction directory: uncompressed
File exists
Archive: NASDAQsh200503.zip
checkdir: cannot create extraction directory: uncompressed
File exists
Archive: NASDAQsh200501.zip
inflating: uncompressed/NASDAQsh20050128.txt
inflating: uncompressed/NASDAQsh20050103.txt
inflating: uncompressed/NASDAQsh20050104.txt
[...]
inflating: uncompressed/NASDAQsh20051215.txt
inflating: uncompressed/NASDAQsh20051214.txt
real 0m38,552s
user 0m40,749s
sys 0m6,195s

Then we count the lines without cleaning the buff/cache

juan@mediacenter:~/tmp/post$
juan@mediacenter:~/elso$ time ls uncompressed/*.txt |parallel wc -l {}
641405 uncompressed/NASDAQsh20050103.txt
[...]
523629 uncompressed/NASDAQsh20051230.txt
real 0m2,141s
user 0m3,387s
sys 0m2,985s

Finally we add both times 0m38,552s + 0m2,141s = 0m38,693s

Summary

Both cases results are:

  • Serial: 2m31,976s
  • Parallel: 0m38,693s

Doing the math we get the following improvement factor 2m31,976s / 0m38,693s = 3,92

So the our parallel approach is almost 4 times faster than the serial one. This might have not impressed you but think in a long running process that is launched every midnight and takes around 10 hours. It means that the process will be still running during office production hours, hitting your databases and/or file servers and loading your network. Probably your users will complain about network or applications  slowness  and most probably your boss will complain as the automatic report expected from the ETL process hasn’t been delivered yet at 8 AM to his inbox.

With the parallel approach the users will not complain about the apps being slow and the report will be at your boss inbox before 3 AM. Also your boss will not have to buy those expensive SSD disk to make the ETL process work faster 🙂

Is there any room for improvement? Of course it is!!!

Till now I have focused on the buff/cache but there is still room for big improvement if we can avoid I/O operations to the disks at all.

How can it be done? By changing the format of the received files. Zip files cannot be uncompressed on-the-fly so we need the full file to be downloaded and written to disk before starting unzipping it. Gz supports on-the-fly un-compressing.

From the wikipedia regarding ZIP format:

A ZIP file is correctly identified by the presence of an end of central directory record which is located at the end of the archive structure in order to allow the easy appending of new files.

https://en.wikipedia.org/wiki/ZIP_(file_format)#End_of_central_directory_record_(EOCD)

Let’s do it. First create all the gz files from the .txt files

juan@mediacenter:~/elso/uncompressed$ ls *.txt | parallel gzip -9 {}

Now fill the cache/buff

juan@mediacenter:~/elso/uncompressed$ cat *.txt.gz >/dev/null

Let’s count again

juan@mediacenter:~/elso/uncompressed$ time ls *.txt.gz |parallel zcat {} |wc -l
139289447
real 0m13,432s
user 0m33,932s
sys 0m13,533s

Wow!! only 13,432s against 2m31,976s of the full serial approach. So avoiding all the IO operations performs 11,3 times better. That’s serious stuff!!!

By now you should be contacting the zip’s files owner and convince him (with your boss help) to provide you with .gz files instead.

Conclusion

Avoid the disks IO operations as much as possible should be a priority in all cases. If data have to be written to the disk try to use buff/cache as much as possible.

Buff/cache can help reduce the IO usage if we are able to group in time the reading operation (network or file) and the processing to be done with its content as we will be increasing the odds that Linux’s Kernel finds the required data in the buff/cache (RAM) and not in the network or in the disk. GNU parallel is a fantastic tool that can help us achieve this grouping in time with only a few extra lines of code.

To be continued…

As we haven’t really work with network in this post we haven’t seen the benefits of parallelization on network operations to workaround the TCP’s congestion protocol and improve network throughput.

The two next articles are already on-the-air:

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.