Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

Posted on 2021-06-08 by Juan Sierra Pons

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

Avoiding IO at all was 11,3 times faster
Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading →

Spreading GNU Parallel by making a testimonial video

Posted on 2021-03-19 by Juan Sierra Pons

Several months ago I was asked to record a small video to Spread GNU Parallel. GNU Parallel is a fantastic tool, a Swiss army knife for process parallelization. With GNU Parallel you can:

Parallelize long boring pipelines with only a few extra lines of code.
Spread load across multiple servers.
Reduce IO in your systems by using buff/cache properly.
Remove IO at all in your systems with fancy one-liners.

Till that moment I had already written 2 post on my web page:

But making videos was a new world for me… Here you are:

Please do not apply any cat filter to it (vlc video.mp3|cat) 😛 😛 It is not funny! (April’s fool is coming)

Optimizing long batch processes or ETL by using buff/cache properly

Posted on 2020-05-31 by Juan Sierra Pons

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen Janssens‘ Data Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading →

A Quick and Neat :) Orchestrator using GNU Parallel

Posted on 2015-05-17 by Juan Sierra Pons

Sometimes you have to deal with servers that you don’t know anything about:

You are a short temp IT consultant with not previous knowledge on the environment.
The CMDB is out of order.
You are on a DR situation.
Or simply the main administrator is not there.

And you need:

Run commands in parallel
Get info from many servers at a time
Troubleshoot DNS problems
Check how many servers are up and running

On my systems I use two orchestrators: MCollective and SaltStack (configured automatically using puppet) that fulfill my needs. But let’s see a quick way to have an orchestrator in a rapid manner.

Continue reading →

El Sotanillo de Juan Sierra Pons

Linux, Open Source, Bash, Virtualization, Cloud, Puppet, DevOps, Blog, Travels, etc.

Tag Archives: Gnu Parallel

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

Spreading GNU Parallel by making a testimonial video

Optimizing long batch processes or ETL by using buff/cache properly

A Quick and Neat :) Orchestrator using GNU Parallel

Share

Share

Share

Share