Optimizing long batch processes or ETL by using buff/cache properly III (full workflow)

In the two previous post we have seen how disk IO and network IO affects our ETLs. For both use cases we have seen several techniques that could be used to improve drastically performance and drive to an efficient resource usage:

  • Avoid IO disk at all.
  • Use buff/cache properly if IO disk couldn’t be avoided.
  • Optimize data download by choosing the right file format, use the Keep-Alive properly and parallelize network operations.

In this post we are going to put together network and processing operations to see the improvement in a complete workflow.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly II (parallelizing network operations)

In the previous post I have focused in avoiding as much as possible IO on disk and if that was not possible using buff/cache as much as possible by grouping in time IO operations. This approach can make our ETL processes run X times faster. In the two examples the numbers where:

  • Avoiding IO at all was 11,3 times faster
  • Using buff/cache was almost 4 times faster

All the examples used a dataset already in the disk so no real network operation occurred. In this post I am going to focus on network operation using again GNU parallel.

Continue reading

Share

Optimizing long batch processes or ETL by using buff/cache properly

During the COVID-19 I have invested some of the “free time” given by the lock down to refresh some old topics like capacity planning and command line optimizations.

In 2011 I got my LPIC-3 and while studying for the previous LPIC-2 two of the topics were Capacity Planning and Predict Future Resource Needs. To refresh this knowledge I recently took Matthew Pearson’s Linux Capacity Planning course from the LinuxAcademy

My interest in Data Science and Business Intelligence started with a course I took where the main tool used was Pentaho mostly PDI (aka Kettle) for ETL jobs and Report Designer for reports automation. Then I continued with Waikato’s university WEKA courses and this path drove me to read Jeroen JanssensData Science at the Command Line book which I have recently re-read again. In his book, Jeroen uses Ole’s Tange GNU parallel a tool I have already written about in my A Quick and Neat 🙂 Orchestrator using GNU Parallel post

How are Linux Capacity Planning, ETL, command line and parallelization of jobs related you might wonder. Let’s dig into it

Continue reading

Share

Backing up a cpanel hosting account

Since 2005 I have hosted this web page in the Cpanel based Bluehost company. First with Joomla and recently migrated to WordPress.

Bluehost allows to download a daily, weekly and monthly backup from your Cpanel control panel, but manual intervention is needed:

  1. Logon in the control panel
  2. Navigate to the backup page
  3. Perform the backup
  4. Download it to your local computer.

This is a manually/time consuming task and of course you should not forget it!!

In this post I gonna show my automatic method to backup files and databases using:

  1. Crontab for automatic backups.
  2. Public/private keys for passwordless ssh connections.
  3. Rsync command for synchronizing directories between remote and local servers. This way bandwidth is reduced as if a file has already been copied to the local server no data transfer is needed.
  4. Mysqldump for dumping the MySQL databases to a local file.
  5. SpiderOak for data deduplication and remote backup.

Some previous knowledge is needed to understand how it works, anyway there are some useful links to understand it. 🙂

Continue reading

Share

Como instruir a SpamAssasin en Alojamientos basados en cPanel

Leyendo los foros de mi proveedor de alojamiento www.bluehost.com encontré un hilo muy interesante sobre como instruir SpamAssassin y después de darle unas cuantas vueltas hice este script para añadir la funcionalidad de que SpamAssasin aprenda de las preferencias de lo que los usuarios han marcado como SPAM o NO SPAM.

Esto significa que con un solo script se añade la funcionalidad de “Marcar como Spam”  o bien “No es Spam”  que tienen algunos de los mas famosos webmails gratuitos Gmail, Yahoo, etc . Y por supuesto que SpamAssasin aprenda de ello  para todas las cuentas de todos los dominios que tengamos alojados. Eso si, siempre que la empresa de alojamiento este basada en cPanel.

Continue reading

Share

Como encontrar un servidor en un CPD

Si bajas a menudo a un CPD porque tienes que hacer algo en algún servidor (patching, cambio de cinta, cambio de disco duro, etc) te habrá pasado alguna vez que no encuentras el servidor en el CPD.

Con este pequeño truco podrás encontrarlo fácilmente a ritmo de Axel Foley 😛

Continue reading

Share

Linux – Acceso al diccionario de la RAE desde la consola shell de Linux

Alguna vez te habrá pasado que no tienes un diccionario de castellano a mano cuando te hace falta. Con este sencillo script, puedes consultar el Diccionario de la Real Academia Española de la lengua desde tu consola favorita.

Continue reading

Share

Linux – Script para modificar multiples ficheros usando un bucle for y sed

Algunas veces nos encontramos con que tenemos que hacer la misma modificación en múltiples ficheros.

Por ejemplo hemos puesto una ruta mal en todos nuestros ficheros .html y tenemos que hacer la misma modificación en todos. Esto podría significar modificar 10, 100 ficheros.

Veamos como podemos ahorrarnos todo este trabajo.

Continue reading

Share

Linux – Script Bash para encontrar ficheros duplicados con diferentes nombres en el mismo directorio

Algunas veces tenemos un directorio lleno de ficheros repetidos con nombres distintos.

Por ejemplo: dentro de un directorio tipo maildir después de algún problema podemos tener un montón de ficheros repetidos con nombres diferentes: el mismo mensaje de e-mail varias veces.

Continue reading

Share