AWS Glue Tutorial for Beginners [NEW 2024 - FULL COURSE]

Johnny Chivers

8 chapters7 takeaways15 key terms6 questions

Overview

This tutorial provides a comprehensive introduction to AWS Glue, a fully managed ETL (Extract, Transform, Load) service. It covers the core components of Glue, including the Data Catalog for metadata management, and demonstrates how to build ETL jobs using both visual interfaces and crawlers. The video also touches upon data quality checks, scheduling mechanisms like triggers and workflows, and introduces AWS Glue DataBrew for visual data preparation. It's designed for beginners looking to understand and utilize AWS Glue for data integration and transformation tasks.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

AWS Glue is a fully managed, serverless ETL service that handles infrastructure management, allowing users to focus on data transformation.
It supports both Apache Spark and Python for ETL processing.
A key component is the Glue Data Catalog, a metadata repository for storing information about data sources and targets.
Glue jobs can be scheduled using its flexible scheduler or external event triggers.

Understanding what AWS Glue is and its core value proposition helps learners decide when to use it for their data integration needs, especially in a serverless environment.

The service extracts data from a source, transforms it via a script running on a serverless engine, and loads it into a target, with metadata managed by the Glue Data Catalog.

The tutorial requires downloading data and setup scripts from a provided GitHub repository.
CloudFormation scripts are used to automate the setup of necessary AWS resources, including S3 buckets and IAM roles.
Specific folder structures (raw data, processed data, script location, temp directory, Athena) need to be created within the S3 bucket.
Sample CSV data for customers and orders must be uploaded into the designated 'raw data' folder in S3.

Proper setup ensures that all necessary permissions and data are in place, allowing for a smooth and effective hands-on experience with AWS Glue services.

Using a CloudFormation template (setup-code.yaml) to create an S3 bucket named 'AWS glue course-Johnny chivers' and an IAM role named 'AWS glue course'.

The Glue Data Catalog is a persistent metastore that stores metadata (location, schema, data types, classification) about data.
It does not move or store the actual data; it only stores references and information on how to access it.
There is one Glue Data Catalog per AWS region, and access can be controlled using IAM policies.
Databases in Glue are logical groupings of associated table definitions.

Grasping the concept of the Data Catalog as a metadata repository is crucial, as it underpins how Glue locates and understands your data without physically moving it.

Registering a MySQL database in the Glue Data Catalog stores its location and schema, but the data remains in the original MySQL database.

Tables in Glue are metadata definitions pointing to data residing in its original store.
AWS Glue Crawlers are programs that automatically discover data and populate the Data Catalog, reducing manual effort.
Alternatively, tables can be added manually by defining their schema, data location, and format.
Crawlers are efficient for discovering many tables, while manual creation is useful for specific definitions or learning purposes.

Understanding both manual table creation and the use of crawlers provides flexibility in how you catalog your data sources, catering to different scenarios and data volumes.

Manually creating a 'customers_raw' table by defining its schema and S3 path, and then using a crawler to automatically create an 'orders' table by pointing it to the S3 'orders' folder.

Partitions are logical folders on S3, mapped from table columns, used to speed up query performance by filtering data based on partition values (e.g., year, month, day).
Glue Connections store connection properties (like connection strings, credentials) for accessing various data stores securely.
The AWS Glue ETL engine is based on Apache Spark, designed for big data workloads, and also supports Python.
DPUs (Data Processing Units) are the unit of compute in Glue; sizing them correctly is vital for job performance and cost-efficiency.

These concepts are fundamental to optimizing ETL job performance, managing data access, and understanding the underlying processing power and cost of AWS Glue.

Partitioning sales data by year, month, and day creates corresponding physical folders in S3, allowing Glue to quickly locate data for a specific date range without scanning the entire dataset.

Visual ETL jobs in AWS Glue allow users to build data transformation pipelines using a drag-and-drop interface without writing extensive code.
Jobs can extract data from sources like the Glue Data Catalog, apply transformations (e.g., adding a timestamp), and load data into targets like S3.
Data can be transformed into formats like Parquet and partitioned for efficient querying.
The visual job creates a script in the background, offering transparency into the generated code.

Visual ETL empowers users, especially those less comfortable with coding, to create powerful data processing pipelines, making ETL more accessible.

Creating a 'processed customers' job that reads the 'customers_raw' table, adds a 'processed_timestamp' column, and saves the output as partitioned Parquet files in the 'processed_data' S3 location, registering a new table 'customers_process' in the Data Catalog.

AWS Glue Data Quality, a newer feature, helps monitor data quality by defining rules using the Data Quality Definition Language (DQL).
Recommended rules can be generated by Glue based on data analysis, which can then be customized.
AWS Glue Triggers initiate ETL jobs or crawlers based on schedules, events, or on-demand execution.
AWS Glue Workflows allow for the orchestration of multiple crawlers and ETL jobs, defining dependencies and creating complex data processing sequences.

Implementing data quality checks and using scheduling/workflow tools ensures data reliability and automates complex data pipelines, crucial for production environments.

Using recommended rules to check if 'sales_order_id' in the 'orders' table is not null and then setting up a trigger to run an ETL job every day at 2 AM.

AWS Glue DataBrew is a separate service for visual data preparation, designed for non-technical users and data analysts.
It offers an Excel-like interface to clean, normalize, and transform data without coding.
DataBrew stores transformations as 'recipes' that can be applied to datasets repeatedly.
While useful for exploration and non-technical users, the speaker expresses caution about using it for production data engineering pipelines due to CI/CD best practices.

DataBrew provides an accessible entry point for data preparation, enabling a wider range of users to clean and shape data before it enters more formal ETL processes.

Using DataBrew to filter a dataset to include only rows where a 'resolution' column contains '66', and then removing duplicate values from the 'assembly_session_id' column, saving this as a reusable recipe.

Key takeaways

1AWS Glue is a serverless ETL service that simplifies infrastructure management for data processing.
2The Glue Data Catalog acts as a central metadata repository, crucial for understanding and accessing data without moving it.
3Crawlers automate the process of populating the Data Catalog, significantly reducing manual effort for data discovery.
4Data partitioning in S3 is a key technique for optimizing query performance in Glue ETL jobs.
5Visual ETL jobs provide a no-code or low-code approach to building data transformation pipelines.
6Scheduling with triggers and orchestrating complex processes with workflows are essential for automating data pipelines.
7AWS Glue DataBrew offers a user-friendly interface for visual data preparation, suitable for exploration and non-technical users.

Key terms

AWS GlueETLGlue Data CatalogMetadataCrawlerDatabase (Glue)Table (Glue)PartitionConnection (Glue)DPU (Data Processing Unit)Visual ETLTriggerWorkflowDataBrewData Quality

Test your understanding

1What is the primary benefit of using AWS Glue as a serverless ETL service?
2How does the AWS Glue Data Catalog differ from a traditional database in terms of data storage?
3Explain the role of an AWS Glue Crawler in the ETL process.
4What is data partitioning in the context of AWS Glue and S3, and why is it important?
5How can you build an ETL job in AWS Glue without writing code?
6What is the purpose of AWS Glue Triggers and Workflows?