Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:

  • Avoid IO disk at all.
  • Use buff/cache properly if IO disk couldn’t be avoided.
  • Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.

In this post we are going to put together network and processing operations to see the improvement in a complete workflow.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

  • Avoiding IO at all was 11,3 times faster
  • Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading

Share

Debian Templates Disk Images Qemu/KVM for libvirt

A long time ago, in a galaxy far far away when I started with openvz I followed this tutorial for Debian template creation. Now I am adapting it (using my own experience and this template-squeeze tutorial too) to Qemu/KVM disk images than later can be used directly or via libvirt.

This procedure tries to generalize the template. While working with disk cloned images many elements need to be “generalized” before capturing and deploying a disk image to multiple computers. Some of these elements include:

  1. ssh keys
  2. /etc/apt/sources.list

The more “generalized” is a template, the less manual work is needed after deploying it.

This method must work in others virtualization systems: vmware, virtualbox, etc. As it is “virtualizator/hypervisor/emulator independent” as it is focused only in the disk image.

Continue reading

Share

Instalando Debian GNU/Linux con LVM+RAID1 en una maquina virtual emulada QEMU

Alguna vez has querido hacer pruebas, practicas con un sistema operativo instalado sobre RAID por software + LVM (Volumenes Lógicos) y no tienes ningún equipo libre para hacer pruebas. Instala una Debian LVM+RAID1 en una maquina virtual emulada usando Qemu.

Continue reading

Share

Instalando Debian GNU/Linux con LVM+RAID1 desde el instalador (debian-installer)

Con las antiguas ISOs de instalación, teníamos que hacer instalación normal en un disco, añadir otro disco, configurar el RAID1 por software y sincronizar los 2 discos. Ahora podemos hacerlo de una sola vez. El nuevo instalador (debian-installer) nos permite instalar desde cero una maquina con RAID1 por software y volúmenes lógicos LVM.

Continue reading

Share

Qemu – Imagen i386 qcow Debian GNU/Linux 4.0 (etch) utilizable desde qemu

Hace poco tiempo que descubri el proyecto FreeOsZoo. Basicamente es un repositorio de imagenes QEMU de sistemas operativos libres. Desde la pagina del proyecto FreeOsZoo incluso podemos probar algunas de estas imagenes QEMU online antes de descargarnos un monton de MB o incluso GB.

Como tenia pensado realizar un par de tutoriales sobre aplicaciones, servidores, redes usando imagenes QEMU y debian, me he decidido a hacer una imagen qcow para qemu segun las especificaciones de la pagina del proyecto FreeOsZoo de mi distribucion linux favorita: Debian.

Continue reading

Share

Windows – Howto Instalación de Qemu en Windows

Qemu es un emulador Open Source de Sistemas Operativos que puede correr tanto en Windows como en Linux. En este Howto, veremos su instalacion en Windows.La emulacion de Sistemas operativos nos puede ser muy util para:

  • Poder correr programas antiguos que no tienen su equivalente actual. Por ejemplo, programas especificos creados para entornos MS-DOS que no es posible migrar. Como pueden ser algunos de contabilidad, inventarios, stocks, etc.
  • Poder jugar a juegos antiguos  que no corren en nuestro sistema operativo habitual: Juegos de Windows95, MS-DOS, etc
  • Hacer pruebas de sistemas operativos, distribuciones Live!, aprenddizaje, etc
  • Lo que se nos ocurra, etc ;P

Continue reading

Share