Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:

  • Avoid IO disk at all.
  • Use buff/cache properly if IO disk couldn’t be avoided.
  • Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.

In this post we are going to put together network and processing operations to see the improvement in a complete workflow.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

  • Avoiding IO at all was 11,3 times faster
  • Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading

Share

Spreading GNU Parallel by making a testimonial video

Several months ago I was asked to record a small video to Spread GNU Parallel. GNU Parallel is a fantastic tool, a Swiss army knife for process parallelization. With GNU Parallel you can:

Till that moment I had already written 2 post on my web page:

But making videos was a new world for me… Here you are:

Please do not apply any cat filter to it (vlc video.mp3|cat) 😛 😛 It is not funny! (April’s fool is coming)

Share

Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading

Share

A Quick and Neat :) Orchestrator using GNU Parallel

Sometimes you have to deal with servers that you don’t know anything about:

  • You are a short temp IT consultant with not previous knowledge on the environment.
  • The CMDB is out of order.
  • You are on a DR situation.
  • Or simply the main administrator is not there.

And you need:

  • Run commands in parallel
  • Get info from many servers at a time
  • Troubleshoot DNS problems
  • Check how many servers are up and running

On my systems I use two orchestrators: MCollective and SaltStack (configured automatically using puppet) that fulfill my needs. But let’s see a quick way to have an orchestrator in a rapid manner.

Continue reading

Share

How to configure the Comtrend’s HG532c ADSL router ARP table for (WOL) Wake On Lan from internet using expect

Several months ago I finally got the (WOL) Wake On Lan feature of my RTL8111/8168B NIC card working. The problem was that a new driver (other than the provided by Debian) and a special PCI configuration was needed.

The other problem I had to deal with was the ADSL Router (Comtrend HG532c, The one provided by the Spanish ISP Jazztel) configuration:

  • Open the required port: This was an easy one just opening the 7 a 9 port and forwarding them to the server we want to WOL from the internet
  • Make the router remember the server’s tuple MAC/IP address. That was easy too, but some manual work was needed as when router is restarted the ARP table is flushed.  🙁

In my current job I had to change recently some configuration and restart more than 600 IP phones. To perform such titanic task I created a quick and dirty script using expect. It worked like a charm and made me think about automatize the way I set the ARP table in my Comtrend HG532c ADSL router.

Continue reading

Share