In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:
Avoiding IO at all was
11,3 times faster Using buff/cache was
almost 4 times faster
All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again
GNU parallel. Continue reading
Several months ago I was asked to record a small video to Spread
GNU Parallel. GNU Parallel is a fantastic tool, a Swiss army knife for process parallelization. With GNU Parallel you can:
Till that moment I had already written 2 post on my web page:
But making videos was a new world for me… Here you are:
Please do not apply any cat filter to it (vlc video.mp3|cat) 😛 😛 It is not funny! (April’s fool is coming)
During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.
In 2011 I got my
LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s course from the Linux Capacity Planning LinuxAcademy
My interest in Data Science and Business Intelligence started with a course I took where the main tool used was
Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read ‘ Jeroen Janssens Data Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post
How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it
Posted in Bash, DevOps, ETL, GnuParallel, Linux, Scripts, SysOps, Virtualization |
Tagged batches, Capacity Planning, ETL, Gnu Parallel, Linux, Optimization, Serial VS parallel |
Sometimes you have to deal with servers that you don’t know anything about:
You are a short temp IT consultant with not previous knowledge on the environment.
The CMDB is out of order.
You are on a DR situation.
Or simply the main administrator is not there.
And you need:
Run commands in parallel
Get info from many servers at a time
Troubleshoot DNS problems
Check how many servers are up and running
On my systems I use two orchestrators:
MCollective and SaltStack (configured automatically using puppet) that fulfill my needs. But let’s see a quick way to have an orchestrator in a rapid manner.