Optimizing long batch processes or ETL by using buff/cache properly

Real life example

In order to illustrate this IO resource starvation problem we are going to launch some basic on-liners commands on a public NASDAQ data set from www.nasdaqtrader.com. Data from 2005 that can be downloaded from their FTP server.

Emptying the buff/cache

First all in order to establish a baseline the buffer cache should be emptied. At this moment we have 1.6G of memory used for buff/cache after cleaning the buff/cache

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches root@mediacenter:~# free -h
total used free shared buff/cache available Mem:
15Gi 8,2Gi 5,9Gi 715Mi 1,6Gi 6,5Gi
Swap: 0B 0B 0B

Getting the “real” Data

The following code will download using GNU parallel the data set for 2005.

The computer used for this tests has 4 cores and 16G of RAM therefore parallel wget command will download the files in batches of four.

parallel wget ftp://ftp.nasdaqtrader.com/symboldirectory/regshopilot/NASDAQsh2005{1}.zip ::: {01..12}

We will have downloaded 12 zip files with 916M of size

juan@mediacenter:~/tmp/post$ ls -lh *.zip
-rw-r--r-- 1 juan juan 71M may 30 22:51 NASDAQsh200501.zip
-rw-r--r-- 1 juan juan 69M may 30 22:52 NASDAQsh200502.zip
-rw-r--r-- 1 juan juan 77M may 30 22:54 NASDAQsh200503.zip
-rw-r--r-- 1 juan juan 76M may 30 22:55 NASDAQsh200504.zip
-rw-r--r-- 1 juan juan 75M may 30 22:57 NASDAQsh200505.zip
-rw-r--r-- 1 juan juan 78M may 30 22:58 NASDAQsh200506.zip
-rw-r--r-- 1 juan juan 75M may 30 23:00 NASDAQsh200507.zip
-rw-r--r-- 1 juan juan 81M may 30 23:01 NASDAQsh200508.zip
-rw-r--r-- 1 juan juan 75M may 30 23:03 NASDAQsh200509.zip
-rw-r--r-- 1 juan juan 84M may 30 23:05 NASDAQsh200510.zip
-rw-r--r-- 1 juan juan 82M may 30 23:06 NASDAQsh200511.zip
-rw-r--r-- 1 juan juan 78M may 30 23:08 NASDAQsh200512.zip
juan@mediacenter:~/tmp/post$ du -sh
916M    .

Please note that while we are downloading the files to local the buff/cache is also being filled. So till the buff/cache is cleaned or overwritten with newer data the access to this files will be quicker.

juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available Mem:
15Gi 8,4Gi 4,8Gi 782Mi 2,5Gi 6,3Gi
Swap: 0B 0B 0B

If we to some math, the 2.5G is the buff/cache used after downloading the files minus the previous value of 1.6G is 0.9G that it the size of all the files.

Simulating getting the data with cat

In order to simulate getting the data from the real source but without consuming www.nasdaqtrader.com‘s band width we are going to use cat to read the files from local filesystem instead from the remote FTP server and putting them in the buff/cache, then we will check that the next accesses take less time by using the time command

juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available
Mem: 15Gi 8,4Gi 5,7Gi 782Mi 1,6Gi 6,3Gi
Swap: 0B 0B 0B
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m6,649s
user 0m0,014s
sys 0m0,738s
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m0,239s
user 0m0,001s
sys 0m0,237s
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
real 0m0,211s
user 0m0,001s
sys 0m0,210s
juan@mediacenter:~/tmp/post$ free -h
total used free shared buff/cache available
Mem: 15Gi 8,4Gi 4,8Gi 782Mi 2,5Gi 6,3Gi
Swap: 0B 0B 0B

First cat command took 6.649s while the second and the third attempt took less than 0.239s. Let’s do the math again redaing a file using the buff/cache was almost 28 time faster!!

The buff/cache is the same value of 2.5G – 1.6G = 0.9G Meaning that the cat procedure is simulating perfectly the downloading of the files at buff/cache usage level.

Serial unzip with buff/cache VS parallel unzip without buff/cache

Now that we are comfortable working with buff/cache we are going to test two cases and see how do they perform

Serial unzip with buff/cache

juan@mediacenter:~/tmp/post$ free -h                                       
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,5Gi       5,4Gi       785Mi       1,8Gi       6,1Gi
Swap:            0B          0B          0B                                
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null                     
                                                                           
real    0m6,113s                                                           
user    0m0,012s                                                           
sys     0m0,732s                                                           
juan@mediacenter:~/tmp/post$ free -h                                       
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,5Gi       4,5Gi       785Mi       2,7Gi       6,1Gi                                                             
Swap:            0B          0B          0B                                
juan@mediacenter:~/tmp/post$ time for i in  `ls *.zip`; do unzip -d uncompressed $i; done                             
Archive:  NASDAQsh200501.zip                                                                                          
  inflating: uncompressed/NASDAQsh20050128.txt                             
  inflating: uncompressed/NASDAQsh20050103.txt              
[....]
  inflating: uncompressed/NASDAQsh20051219.txt
  inflating: uncompressed/NASDAQsh20051216.txt
  inflating: uncompressed/NASDAQsh20051215.txt
  inflating: uncompressed/NASDAQsh20051214.txt

real    0m53,171s
user    0m32,330s
sys     0m5,866s

Unzipping in a serial loop the 12 files using buff/cache took 53.171s

Parallel unzip without buff/cache

juan@mediacenter:~/tmp/post$ free -h          
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,6Gi       5,7Gi       785Mi       1,3Gi       6,0Gi
Swap:            0B          0B          0B   
juan@mediacenter:~/tmp/post$ time ls *.zip |parallel  unzip -d uncompressed {}
Archive:  NASDAQsh200501.zip                  
checkdir:  cannot create extraction directory: uncompressed
           File exists                        
Archive:  NASDAQsh200502.zip                  
[...]
  inflating: uncompressed/NASDAQsh20051216.txt
  inflating: uncompressed/NASDAQsh20051215.txt
  inflating: uncompressed/NASDAQsh20051214.txt

real    0m50,119s
user    0m34,300s
sys     0m5,715s

Unzipping in parallel the 12 files reading all the data from disk took 50.119s

Summary

Both examples performance are equivalent 53.171s vs 50.119s. It seems that parallel process accessing the disk is equivalent to serial process accessing the buff/cache.

The real performance booster should be parallelization with buff/cache usage. Let’s see how it goes

juan@mediacenter:~/tmp/post$ free -h          
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,4Gi       6,1Gi       771Mi       1,2Gi       6,3Gi
Swap:            0B          0B          0B   
juan@mediacenter:~/tmp/post$ time cat *.zip >/dev/null
                                              
real    0m8,276s                              
user    0m0,012s                              
sys     0m0,738s                              
juan@mediacenter:~/tmp/post$ free -h          
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,4Gi       5,2Gi       771Mi       2,1Gi       6,3Gi
Swap:            0B          0B          0B   
juan@mediacenter:~/tmp/post$ time ls *.zip |parallel  unzip -d uncompressed {}
Archive:  NASDAQsh200501.zip               
  inflating: uncompressed/NASDAQsh20050128.txt
  inflating: uncompressed/NASDAQsh20050103.txt
[...]
  inflating: uncompressed/NASDAQsh20051216.txt
  inflating: uncompressed/NASDAQsh20051215.txt
  inflating: uncompressed/NASDAQsh20051214.txt

real    0m45,837s
user    0m43,599s
sys     0m7,291s

Unzipping in parallel the 12 files using buff/cache took 45.837s. It is only a little bit better. Disappointing right?

What do all three have in common? they write to the hard disk, the slowest component in the whole pipeline.

Is it really necessary to to write down the files?. Maybe not, as usually we want the data inside the zip files. In this cases, it is better to unzip the files do the processing needed and discard the writing in the hard disk.

Let’s see how it goes by just counting the lines of the unzipped files and not writing anything to disk:

Parallel

juan@mediacenter:~/tmp/post$ time ls *.zip |parallel  unzip -c  {} |wc -l
165414981

real    0m24,252s
user    0m41,877s
sys     0m15,888s

Serial

juan@mediacenter:~/tmp/post$ time for i in  `ls *.zip`; do unzip -c $i; done |wc -l
165414981

real    0m39,785s
user    0m37,365s
sys     0m6,174s

These ones look far better. 🙂 Parallel job performs better that parallel in this situation.

Conclusion

Avoid the disks IO operations as much as possible should be a priority in all cases.

Buff/cache can help reduce the IO if we are able to group in time the reading operation (network or file) and the processing to be done with its content as we will be increasing the odds that Linux’s Kernel finds the required data in the buff/cache (RAM) and not in the network or in the disk.

Homework

Imagine two big ETL processes that have to download TBs of data for processing.

  • One that  downloads all the data (overwriting the buff/cache) and process all the data as it comes
  • One that has a queue system that  downloads the data only moments before the data is ready to be processed

Which one will be quicker? And Cheaper?

To be continued…

In the coming post we will try to create a very simple queue system using parallel and use all the Linux Capacity tools like sar, iotop, dstat, iftop, htop, etc to find bottlenecks, performance reports.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.