Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:

  • Avoid IO disk at all.
  • Use buff/cache properly if IO disk couldn’t be avoided.
  • Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.

In this post we are going to put together network and processing operations to see the improvement in a complete workflow.

Dataset to be used

In this case we are going to use the allCountries.zip file with some modifications from  Geonames.

The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge.

I have downloaded the file to my web server unzipped it and gzipped again. Remember that .gz is our favorite file format to avoid disk IO.

root@do1:/var/www/html/elsotanillo.net/wp/wp-content/uploads/parallel/allCountries# ls -lh all*
-rw-r--r-- 1 root root 1.5G Jun 23 02:54 allCountries.txt
-rw------- 1 root root 164M Jun 23 11:28 allCountries.txt.gz

Once we have got the big file we are going to split it in around 100K small files:

split -a 6 -d -l 122 allCountries.txt

And then gzip these pieces:

ls x* |parallel gzip -9 {}

Finally we delete the allCountries.txt as it is not going to be used anymore.

So, at this moment we have 99256 gzipped small files ( less than 7K on average) with a total size of 548M and the gzipped version allCountries.txt.gz (164M).

The task to be performed is the full workflow we have seen in previous posts but mixed  together:

  1. Download the files
  2. Count the file lines

Downloading serial vs parallel (with and without Keep-Alive)

In order to download the files we are going to create a urls.txt file that will be feed to the wget command:

parallel -j4 echo http://www.elsotanillo.net/wp-content/uploads/parallel/allCountries/x{}.gz >urls.txt ::: `seq -w 00000 99999`

As I know that we have 99256 files I am going to remove 99256 and subsequent lines from the urls.txt as those files do not exist.

Downloading small .gz files in serial mode (without the Keep-Alive) and writing to disk

With the following command we are going to simulate a for loop that reads the urls from the urls.txt and launch the wget command. This is the worst possible case as we will not be using the Keep-Alive feature and we will be placed at the very beginning of the sawtooth bandwidth graph.

juan@mediacenter:~/elsotanillo$ time cat urls.txt | parallel -j1 wget -q {}
real 297m35,349s
user 7m44,317s
sys 12m11,234s

Which is 30.69 kB/s Again pretty bad :(. This is our worse case scenario and will be our baseline.

You might think NO way!!! I have done this before (the loop feeding a wget or curl command to download a bunch of files) and the performance wasn’t that bad, Probably you were working with files bigger than 500KB therefore you were working at the end of the sawtooth bandwith graph. As you can see,  in this case (file) size really matters 😛

Downloading small .gz files in serial mode (with the Keep-Alive) and writing to disk

Let’s do the same approach but using the Keep-Alive feature. Wget support reading a file with the urls. This way we will be reusing connections.

juan@mediacenter:~/elsotanillo$ time wget -q -i urls.txt
real 68m9,460s
user 0m18,868s
sys 0m30,370s

Which is 134 kB/s So reusing connections with the Keep-Alive shows a big improvement. the drawback with this reading the file approach is that it cannot parallelize the downloads.

Downloading small .gz files in parallel mode (with the Keep-Alive) and writing to disk

The parallel approach with Keep-Alive looks like this:  (using 100 jobs in parallel)

time cat urls.txt | parallel -j100 -m wget -q {}
real 3m2,536s
user 0m20,267s
sys 0m23,031s

Which is 3.002 MB/s Big improvement here :). Let’s do some math:

Parallel with Keep-alive vs serial without Keep-alive is (3.002 MB/s) / (30.69 kB/s) =  97.82 times faster.

Parallel with Keep-alive vs serial with Keep-alive is (3.002 MB/s) / (134 kB/s) =  22.4 times faster.

Counting lines without buff/cache vs using buff/cache

In the first post we have seen the performance booster buff/cache (vs read from disk) is if we cannot avoid writing to disk (that should be always your priority).

Counting lines without using the buff/cache and writing temp files to disk

First we have to clear the buff/cache (as root). We are clearing this to simulate a real case scenario. If you are downloading a lot of data into your disks before processing it, the odds that the buff/cache is overwritten with newer data is very high so you will be reading from disk (the slower component). Also you are gunziping the files and writing them to disk and read again from disk so another cleaning the buff/cache is in order in the middle

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches

Once buff/cache is cleaned, let's gunzip the files and write them to disk:

juan@mediacenter:~/elsotanillo$ time ls x*.gz | parallel -j1 'gunzip -c {} > {.}.txt'
real 19m52,731s
user 7m14,143s
sys 7m26,975s

Clear again the buff/cache:

root@mediacenter:~# sync; echo 3 > /proc/sys/vm/drop_caches

And finally read files and count lines :

juan@mediacenter:~/elsotanillo$ time cat x*.txt|wc -l
12109117
real 0m49,820s
user 0m1,076s
sys 0m6,544s

So the total time is 19m52,731s + 0m49,820s = 20m42,55s

I can think in an even worse use case: gunziping the files and writing them to disk in parallel and finally read them again in parallel mode before processing them. In this case the disk performance could be very bad as a lot of unnecessary read/write IO operations could be happening at the same time. We are not going to follow that dark path in this post.

Counting lines using buff/cache

Let’s fill the buff/cache (just in case) and count again:

juan@mediacenter:~/elsotanillo$ cat *.gz >/dev/null
juan@mediacenter:~/elsotanillo$ time zcat *.gz |wc -l
12109117
real 0m13,865s
user 0m10,526s
sys 0m2,423s

Big improvement here. Let’s do some math: (20m42,55s / 0m13,865s) = 89.62 times faster. It is really worthy group operations in time and use buff/cache

Downloading  and counting without writing to disk and using Keep-alive

Downloading and counting the gzipped big file in serial mode

If the data to be processed is provided in a big file, big enough that we can discard the sawtooth effect, the numbers look like this:

juan@mediacenter:~/elsotanillo$ time wget -q http://www.elsotanillo.net/wp-content/uploads/parallel/allCountries/allCountries.txt.gz -O - |zcat|wc -l
12109117
real 2m44,535s
user 0m14,521s
sys 0m7,067s

This could be an ideal case as we will be avoiding working at the very beginning of the sawtooth and we will not be writing to disk. That’s the reason the performance even in serial mode is so good. I see two drawbacks here:

  • We have an external dependency, that the data has to be provided in big files.
  • This task runs serially. For the downloading part doesn’t matter as the bandwidth should be equivalent to the parallel approach (see below). BUT  (a big but) once we have to process the data we will be using only one core :(. What a waste if we have many of them.

Downloading and counting small .gz files in parallel mode

If we are getting our data in small files we can use Gnu Parallel to avoid the two big problems with have seen:

juan@mediacenter:~/elsotanillo$ time cat urls.txt | parallel -j100 -m wget -q {} -O - |zcat|wc -l
12109117
real 2m40,392s
user 0m28,540s
sys 0m18,103

Summary

Let’s write down all the results in nice table:

Description

 

Download time

Counting time

 

Total

Improvement ratio

   

Small .gz files in serial mode (without the Keep-Alive) and writing to disk

 
297m35,349s
20m42,55s
 
318.3m
1(baseline)
   

small .gz files in serial mode (with the Keep-Alive) and writing to disk

 
68m9,460s
20m42,55s
 
98.16m
3.24
   

Small .gz files in parallel mode (with the Keep-Alive) and writing to disk

 
3m2,536s
20m42,55s
 
23.75m
13.40
   

 

               

Downloading and counting the gzipped big file in serial mode

 
2m44,535s
NA
 
2m44,535s
116.07
   
                 

Downloading and counting small .gz files in parallel mode

 
2m40,392s
NA
 
2m40,392s
119.07
   

Our big file serial approach and the small files parallel approach perform similar. In this particular case we are only working with ~100K files and a server with only 4 cores. The parallel approach should perform far better when the number of files growth to several millions and the server running the process has more cores.

Is there room for improvement?

I think so, now that we are streaming our data directly to our cores we can think in a proper architecture to our workflow. After some benchmarking we can discover if our workflow is more CPU or RAM demanding. Knowing this we can plan the kind of servers to be used.

We can ask our IT guys for specific RAM or CPU optimized VMs to be created for our workflow or we can even run it on external public clouds. DigitalOcean, OVH and other public cloud  providers allow us to customize the VMs to be spin up.

Another option I am thinking is using our own undercover night private cluster. Let’s dig in to this idea: Many companies have a lot of idle computers that remain switched on during nights or week-ends. There is a lot of wasted RAM and CPU on those computers. GNU Parallel support running our scripts in remote computers. Sounds good right? We can tell our boss not only that we don’t need those expensive SSD disks like in the first post, we can tell that we don’t even need a server at all for these heavy nightly tasks :P. We could have our own Seti@home-like cluster.

Conclusions:

  • Do not write to your disks ever :P. If this is not possible group in time the IO operations so you will be using buff/cache properly.
  • Small improvements multiplied per thousand or millions of files can show spectacular results. Using similar techniques I have improved existing workflows 300 times, from 5 hours to minutes.
  • Human beings are used to think serially. Gnu Parallel allows us to think serially but execute in parallel and across many computers.
  • In all my examples I have use one-liners. Gnu Parallel can launch Python, Java, etc programs with no or few modifications. Give your programs the power to use all your CPUs without the headache of writing specific multiprocessing programs.

Did you found these articles interesting?

Did you save some time/resources?

Did your boss save a lot of money?

Drop me a line in the comments 🙂

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.