We have built a complete ETL pipeline and data warehouse using AWS Glue and AWS S3 services for EdCast. In this blog, we will share the key learnings from that experience.
Glue Development Environment
Glue ETL scripts can be developed and tested in multiple ways. More prominent options are
Using development end point and notebook (AWS hosted)
Using development end point and Zepplin notebook server in local environment
Using local development using ETL library
Using development end point and notebooks (remote)
AWS development end point is a managed(paid) Glue environment for developing and testing ETL scripts. This environment includes Apache Spark and Glue libraries along with network configuration that allows to securely access the environment from Jupyter notebook.
AWS supports launching a EC2 machine with Jupyter Notebook server. Jupyter Notebook can be used to interactively author and test the ETL scripts, which will be used in Glue jobs.
For more information on development end point
Pros
Easy to launch and use
Since the development endpoint is similar to actual AWS Glue environment, it's easy to develop and test in the actual production like environment.
Cons
Expensive as it requires $1500 for dev endpoint(as of 03/22/2020) per month and ec2 machine cost to host notebook server
Need internet connection to notebook server for development
No easy way to write unit test cases
Using development end point and Jupiter notebooks in local environment
This option is same as the above except this allows to run the Zepplin notebook server in local environment and connects to development endpoint via SSH tunnel. In order to use this option, development end point must be updated with user specific SSH public keys (RSA). This can be done by accessing "Rotate SSH Keys" option in the Dev End point home page.
Docker version of Zepplin 0.8.2 doesn’t work with Glue development endpoint due to some bugs. So we installed the binary locally using binary and used it.
Pros
No need to have a separate notebook server
Easy to use in local machine - We observed flaky UI due to poor network connection
Cons
Still expensive as it requires $1500 for dev endpoint per month
Slow development as dev endpoint runs remotely
No easy way to write unit test cases
Using local development using ETL library
AWS has recently released the AWS glue libraries which can be used to setup the local development environment. This helps to integrate Glue ETL jobs with maven build system for building and testing.
ETL development can be done using Zepplin server or even using PyCharm (Professional 2019.3) or MS Visual Code. We use PySpark as language for our ETL scripts. So we use PyCharn for developing the scripts. PyCharm allows to run and debug the job scripts locally. It also allows us to remotely debug the issues.
Pros
Easy to use and faster development and testing
Cheaper
Unit testing can be done
For more information on the steps to setup and run the glue jobs locally :
In the next blog, we will explain the steps required to setup PyCharm with Glue ETL library for local debugging.
Comments