Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

Posted on 2021-06-24 by Juan Sierra Pons

In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:

Avoid IO disk at all.
Use buff/cache properly if IO disk couldn’t be avoided.
Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.

In this post we are going to put together network and processing operations to see the improvement in a complete workflow.

Continue reading →

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

Posted on 2021-06-08 by Juan Sierra Pons

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

Avoiding IO at all was 11,3 times faster
Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading →

Optimizing long batch processes or ETL by using buff/cache properly

Posted on 2020-05-31 by Juan Sierra Pons

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen Janssens‘ Data Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading →

DevOps job interviews with old fashioned check list questions

Posted on 2015-09-15 by Juan Sierra Pons

During the last few weeks I have been interviewed for several DevOps positions. In two of them I had to reply a skills check-list and in the other one an exercise to be solved and send back by email. I think these check-list interviews are not good for DevOps positions, specially if the check-lists used are not updated properly. Let’s see why…

Continue reading →

A Quick and Neat :) Orchestrator using GNU Parallel

Posted on 2015-05-17 by Juan Sierra Pons

Sometimes you have to deal with servers that you don’t know anything about:

You are a short temp IT consultant with not previous knowledge on the environment.
The CMDB is out of order.
You are on a DR situation.
Or simply the main administrator is not there.

And you need:

Run commands in parallel
Get info from many servers at a time
Troubleshoot DNS problems
Check how many servers are up and running

On my systems I use two orchestrators: MCollective and SaltStack (configured automatically using puppet) that fulfill my needs. But let’s see a quick way to have an orchestrator in a rapid manner.

Continue reading →

My first puppet module released juasiepo-knockd

Posted on 2013-11-19 by Juan Sierra Pons

Today I have released to the public my first puppet module:

juasiepo-knockd on github
juasiepo-knockd on puppetlabs’ forge

It installs and configures knockd (a port knocking software).
Continue reading →

Creacion del Alicante Puppet Users Group

Posted on 2013-08-26 by Juan Sierra Pons

Llevaba ya tiempo dandole vueltas a la idea de montar un grupo de usuarios de puppet en Alicante, que no se si habra muchos…

La semana pasada mande un correo a la lista de usuarios de puppet por si habia alguien interesado y hoy he recibido un correo de puppetlabs.com indicandome que si tenia un grupo de meetup, que ellos me pondrian un link en su web. por lo que me he decidido a crear un group en meetup.com.

Por lo que oficialemente hoy ha sido creado el Alicante Puppet Users Group

Asi que si estas interesado en Puppet, DevOps, Data Center and Operations Automation y basicamente hacer las cosas una sola vez y que los ordenadores hagan el resto. Este es tu grupo.

Espero que os apunteis y cuando seamos unos cuantos hagamos la primera quedada.

Salu2 puppeteros Alicantinos

El Sotanillo de Juan Sierra Pons

Linux, Open Source, Bash, Virtualization, Cloud, Puppet, DevOps, Blog, Travels, etc.

Category Archives: SysOps

Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

Optimizing long batch processes or ETL by using buff/cache properly

DevOps job interviews with old fashioned check list questions

A Quick and Neat :) Orchestrator using GNU Parallel

My first puppet module released juasiepo-knockd

Creacion del Alicante Puppet Users Group

Share

Share

Share

Share

Share

Share

Share