Sharpen your Data Engineering Toolkit

Setup your work environment.

Posted by Rahul Kumar on Sat 19 May 2018

Why talk about tools?

A lot of times in technology it's easy to lose sight of the end and get excited about the means. Don't get me wrong, some of the advances that have happened in technology have improved the state of the art like a step function.

However, the choice of tools is a function of the domain and the scope of the problem that you are looking to solve. So, it stands to reason, if you don't know what the available tools are capable of, you will end up using the wrong tool to solve your problem.

The tools I describe below are the tools that I use. They may not be the perfect tools for all data engineering problems, but you have to start somewhere. You may or may not want to use them depending on your preferences.

For example, I use Linux as my primary OS. You don't have to use Linux if you prefer another OS. Most tools these days are available for Windows and MacOS.

Must Have Tools

Linux (Ubuntu 18.04)

I prefer using Linux as my main OS for a few reasons. It's free to install and use, it's fast and has a smaller memory footprint when compared to other OS's, and most production systems are Linux based systems, so it makes sense to develop on a Linux workstation. Another important factor to consider is that a lot of devops tools are natively built for Linux. A good example is Docker. I'm going to talk more about Docker a little later.

Some add on tools I like using in Linux include:

  1. guake: this is a really good terminal program that's activated with the F12 key.
  2. terminator: this is another good terminal program that can be split into multiple horizontal and vertical tiles. Useful if you want to monitor multiple terminal windows simultaneously.

Python 3.6

Python 3.6 is installed by default on all the latest Linux distributions. If you use windows or MacOS you can install Python directly from Python.org

We also need two special pacakges

  1. python3-venv: for creating python virtual environments to manage project dependencies.
  2. python3-pip: this is the python package manager that installs python pacakges from pypi.

PostgreSQL (pronounced postgres)

I have to admit, I'm a big fan of postgres. It's an open source, full featured, production ready relational database. It has a small memory footprint and supports a variety of data types and objects. It can be extended by creating extensions. For example Citus Data has created an open source extension that turns postgres into a distributed database.

Apache Spark

Apache Spark is a distributed, big data processing engine. It is meant to be run on top of Apache Hadoop. Apache Spark processes data in memory and is much faster than Hadoop. Spark has multiple libraries that enable creating big data pipelines easy with Python and Scala. It has the ability to perform batch and streaming data processes, all in memory. All in all, it's a general purpose big data processing engine and an essential tool in any Data Engineer's toolbelt.

Scala

Scala is a relatively new language that is compiled to the JVM. Apache Spark was created in Scala and it provides a Scala runtime to create data pipeline jobs. In addition to Python, Scala is a language that is worth investing your time on.

Editors/IDE's

  1. VS Code: I have used various IDE's for Python: Atom, PyCharm, etc, but none compare to the level of popularity than VSCode. It is intuitive to use and has lots of extension available.
  2. vim/nano: For certain code updates, it's not always necessary to open up a full editor to make changes. You might have to ssh into a remote server with no IDE installed. In such cases, it's useful to know a terminal based editor like nano or vim. Vim has a steep learning curve and is popular with a lot of developers.

Good to Have Tools

In addition to the must have tools above, there is one good to have tool for your development. It's not necessary to use it but it can streamline your development workflow.

Docker

Docker containers are used to separate your application from the infrastructure it runs on. Containers abstract away the operating system and handle only your application and it's dependencies. Once you dockerize your application into an image, you can create a container based on that image and run it anywhere: your laptop, your workstation, or a remote server, without having to install anything manually anywhere.

You can learn more about Docker, install it for your OS and go through the tutorial. It might take a while for you to develop competence in using docker and adapting it for your development workflow, but it's well worth it in the long run.

And that's it. These are the basic tools I use. I'm certain that this list is a work in progress and I will continue to add more tools as I journey deeper into my explorations. Once I add a new tool, I will update this post to reflect any changes.

tags: tools