Data Engineering Essentials using SQL, Python, and PySpark




Data Engineering Essentials using SQL, Python, and PySpark

As part of this course, you will learn all the Data Engineering Essentials related to building Data Pipelines using SQL, Python as Hadoop, Hive, or Spark SQL as well as PySpark Data Frame APIs. You will also understand the development and deployment lifecycle of Python applications using Docker as well as PySpark on multinode clusters. You will also gain basic knowledge about reviewing Spark Jobs using Spark UI.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

Here are some of the challenges the learners have to face to learn key Data Engineering Skills such as Python, SQL, PySpark, etc.

  • Having an appropriate environment with Apache Hadoop, Apache Spark, Apache Hive, etc working together.

  • Good quality content with proper support.

  • Enough tasks and exercises for practice

This course is designed to address these key challenges for professionals at all levels to acquire the required Data Engineering Skills (Python, SQL, and Apache Spark).

To make sure you spend time learning rather than struggling with technical challenges, here is what we have done.

  • Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment and acknowledge it by providing ratings and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to [email protected] to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.

  • Make sure we have a system with the right configuration and quickly set up a lab using Docker with all the required Python, SQL, Pyspark as well as Spark SQL material. It will address a lot of pain points related to networking, database integration, etc. Feel free to reach out to us via Udemy Q&A, in case you struck at the time of setting up the environment.

  • You will start with foundational skills such as Python as well as SQL using a Jupyter-based environment. Most of the lecturers have quite a few tasks and also at the end of each and every module, there are enough exercises or practice tests to evaluate the skills taught.

  • Once you are comfortable with programming using Python and SQL, then you will ensure you understand how to quickly set up and access Single Node Hadoop and Spark Cluster.

  • The content is streamlined in such a way that, you use learner-friendly interfaces such as Jupyter Lab to practice them.

If you end up signing up for the course do not forget to rate us 5* if you like the content. If not, feel free to reach out to us and we will address your concerns.

Highlights of this course

Here are some of the highlights of this Data Engineering course using technologies such as Python, SQL, Hadoop, Spark, etc.

  • The course is designed by 20+ years of experienced veteran (Durga Gadiraju) with most of his experience around data. He has more than a decade of Data Engineering as well as Big Data experience with several certifications. He has a history of training hundreds of thousands of IT professionals in Data Engineering as well as Big Data.

  • Simplified setup of all the key tools to learn Data Engineering or Big Data such as Hadoop, Spark, Hive, etc.

  • Dedicated support where 100% of questions are answered in the past few months.

  • Tons of material with real-world experiences and Data Sets. The material is made available both under the Git repository as well as in the lab which you are going to set up.

  • Complementary Lab Access for 2 Weeks which can be extended to 8 Weeks.

  • 30 Day Money back guarantee.

Content Details

As part of this course, you will be learning Data Engineering Essentials such as SQL, and Programming using Python and Apache Spark. Here is the detailed agenda for the course.

  • Data Engineering Labs - Python and SQL

You will start with setting up self-support Data Engineering Labs on Cloud9 or on your Mac or PC so that you can learn the key skills related to Data Engineering with a lot of practice leveraging tasks and exercises provided by us. As you pass the sections related to SQL and Python, you will also be guided to set up Hadoop and Spark Lab.

  1. Provision AWS Cloud9 Instance (in case your Mac or PC does not have enough capacity)

  2. Setup Docker Compose to start the containers to learn Python and SQL (using Postgresql)

  3. Access the material via Jupyter Lab environment setup using Docker and learn via hands-on practice.

Once the environment is set up, the material will be directly accessible.

  • Database Essentials - SQL using Postgres

It is important for one to be proficient with SQL to take care of building data engineering pipelines. SQL is used for understanding the data, performing ad-hoc analysis, and also in building data engineering pipelines.

  1. Getting Started with Postgres

  2. Basic Database Operations (CRUD or Insert, Update, Delete)

  3. Writing Basic SQL Queries (Filtering, Joins, and Aggregations)

  4. Creating Tables and Indexes using Postgres DDL Commands

  5. Partitioning Tables and Indexes using Postgres DDL Commands

  6. Predefined Functions using SQL (String Manipulation, Date Manipulation, and other functions)

  7. Writing Advanced SQL Queries using Postgresql

  • Programming Essentials using Python

Python is the most preferred programming language to develop data engineering applications. As part of several sections related to Python, you will be learning most of the important aspects of Python to build data engineering applications effectively.

  1. Perform Database Operations

  2. Getting Started with Python

  3. Basic Programming Constructs in Python (for loops, if conditions)

  4. Predefined Functions in Python (string manipulation, date manipulation, and other standard functions)

  5. Overview of Collections such as list and set in Python

  6. Overview of Collections such as dict and tuple in Python

  7. Manipulating Collections using loops in Python. This is primarily designed to get enough practice with Python Programming around Python Collections.

  8. Understanding Map Reduce Libraries in Python. You will learn functions such as map, filter, etc. You will also understand details about itertools.

  9. Overview of Python Pandas Libraries. You will be learning about how to read from files, and processing the data in Pandas Data Frame by applying Standard Transformations such as filtering, joins, sorting, etc. Also, you'll be learning how to write data to files.

  10. Database Programming using Python - CRUD Operations

  11. Database Programming using Python - Batch Operations. There will be enough emphasis on best practices to load data into Databases in bulk or batches.

  • Setting up Single Node Data Engineering Cluster for Practice

The most common approach to building data engineering applications at scale is by using Apache Spark integrated with HDFS and YARN. Before getting into data engineering using Apache Spark and Hadoop, we need to set up an environment to practice data engineering using Apache Spark. As part of this section, we will primarily focus on setting up a single node cluster to learn key skills related to data engineering using distributed frameworks such as Apache Spark and Apache Hadoop.

We have simplified the complex tasks of setting up Apache Hadoop, Apache Hive, and Apache Spark leveraging Docker. Within an hour without running into too many technical issues, you will be able to set up the cluster. However, if you run into any issues, feel free to reach out to us and we will help you to overcome the challenges.

  • Master required Hadoop Skills to build Data Engineering Applications

As part of this section, you will primarily focus on HDFS commands so that we can copy files into HDFS. The data copied into HDFS will be used as part of building data engineering pipelines using Spark and Hadoop with Python as a Programming Language.

  1. Overview of HDFS Commands

  2. Copy Files into HDFS using put or copyFromLocal command using appropriate HDFS Commands

  3. Review whether the files are copied properly or not to HDFS using HDFS Commands.

  4. Get the size of the files using HDFS commands such as du, df, etc.

  5. Some fundamental concepts related to HDFS such as block size, replication factor, etc.

  • Data Engineering using Spark SQL

Let us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.

  1. Getting Started with Spark SQL

  2. Basic Transformations using Spark SQL

  3. Managing Tables - Basic DDL and DML in Spark SQL

  4. Managing Tables - DML and Create Partitioned Tables using Spark SQL

  5. Overview of Spark SQL Functions to manipulate strings, dates, null values, etc

  6. Windowing Functions using Spark SQL for ranking, advanced aggregations, etc.

  • Data Engineering using Spark Data Frame APIs

Spark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Apache Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.

  1. Data Processing Overview using Spark or Pyspark Data Frame APIs.

  2. Projecting or Selecting data from Spark Data Frames, renaming columns, providing aliases, dropping columns from Data Frames, etc using Pyspark Data Frame APIs.

  3. Processing Column Data using Spark or Pyspark Data Frame APIs - You will be learning functions to manipulate strings, dates, null values, etc.

  4. Basic Transformations on Spark Data Frames using Pyspark Data Frame APIs such as Filtering, Aggregations, and Sorting using functions such as filter/where, groupBy with agg, sort or orderBy, etc.

  5. Joining Data Sets on Spark Data Frames using Pyspark Data Frame APIs such as join. You will learn inner joins, outer joins, etc using the right examples.

  6. Windowing Functions on Spark Data Frames using Pyspark Data Frame APIs to perform advanced Aggregations, Ranking, and Analytic Functions

  7. Spark Metastore Databases and Tables and integration between Spark SQL and Data Frame APIs

  • Development, Deployment as well as Execution Life Cycle of Spark Applications

Once you go through the content related to Apache Spark using a Jupyter-based environment, we will also walk you through the details about how the Spark applications are typically developed using Python, deployed as well as reviewed.

  1. Setup Python Virtual Environment and Project for Spark Application Development using Pycharm

  2. Understand complete Spark Application Development Lifecycle using Pycharm and Python

  3. Build a zip file for the Spark Application, copy it to the environment where it is supposed to run, and run.

  4. Understand how to review the Spark Application Execution Life Cycle.

Desired Audience for this Data Engineering Essentials course

People from different backgrounds can aim to become Data Engineers. We cover most of the Data Engineering essentials for the aspirants who want to get into the IT field as Data Engineers as well as professionals who want to propel their career toward Data Engineering from legacy technologies.

  • College students and entry-level professionals to get hands-on expertise with respect to Data Engineering. This course will provide enough skills to face interviews for entry-level data engineers.

  • Experienced application developers to gain expertise related to Data Engineering.

  • Conventional Data Warehouse Developers, ETL Developers, Database Developers, and PL/SQL Developers to gain enough skills to transition to being successful Data Engineers.

  • Testers to improve their testing capabilities related to Data Engineering applications.

  • Other hands-on IT Professional who wants to get knowledge about Data Engineering with Hands-On Practice.

Prerequisites to practice Data Engineering Skills

Here are the prerequisites for someone who wants to be a Data Engineer.

  • Logistics

    • Computer with decent configuration (At least 4 GB RAM, however 8 GB is highly desired). However, this will not suffice if you do not have a multi-node cluster. We will walk you through the cheaper options to set up the environment and practice.

    • Dual Core is required and Quad-Core is highly desired

    • Chrome Browser

    • High-Speed Internet

  • Desired Background

    • Engineering or Science Degree

    • Ability to use computer

    • Knowledge or working experience with databases and any programming language is highly desired

Training Approach for learning required Data Engineering Skills

Here are the details related to the training approach for you to master all the key Data Engineering Skills to propel your career toward Data Engineering.

  • It is self-paced with reference material, code snippets, and videos provided as part of Udemy.

  • One can either use the environment provided by us or set up their own environment using Docker on AWS or GCP or the platform of their choice.

  • We would recommend completing 2 modules every week by spending 4 to 5 hours per week.

  • It is highly recommended to take care of the exercises at the end to ensure that you are able to meet all the key objectives for each module.

  • Support will be provided through Udemy Q&A.

The course is designed in such a way that one can self-evaluate through the course and confirm whether the skills are acquired.

  • Here is the approach we recommend you to take this course.

    • The course is hands-on with thousands of tasks, you should practice as you go through the course.

    • You should also spend time understanding the concepts. If you do not understand the concept, I would recommend moving on and coming back later to the topic.

    • Go through the consolidated exercises and see if you are able to solve the problems or not.

    • Make sure to follow the order we have defined as part of the course.

    • After each and every section or module, make sure to solve the exercises. We have provided enough information to validate the output.

  • By the end of the course, then you can come to the conclusion that you are able to master essential skills related to SQL, Python, and Apache Spark.

Learn key Data Engineering Skills such as SQL, Python, Apache Spark (Spark SQL and Pyspark) with Exercises and Projects

Url: View Details

What you will learn
  • Setup Development Environment to learn building Data Engineering Applications on GCP
  • Database Essentials for Data Engineering using Postgres such as creating tables, indexes, running SQL Queries, using important pre-defined functions, etc.
  • Data Engineering Programming Essentials using Python such as basic programming constructs, collections, Pandas, Database Programming, etc.

Rating: 4.22442

Level: Intermediate Level

Duration: 66 hours

Instructor: Durga Viswanatha Raju Gadiraju


Courses By:   0-9  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 

About US

The display of third-party trademarks and trade names on this site does not necessarily indicate any affiliation or endorsement of course-link.com.


© 2021 course-link.com. All rights reserved.
View Sitemap