Contact Us

Hide

Online course detail

Apache Spark & Scala certification Training Gurgaon,Delhi - Gyansetu

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. This is making it an inevitable technology and everyone who wants to stay in big data engineering is keep to become an expert in Apache Spark.

Instructor Led Training  |  Free Course Repeat  |  Placement Assistance  |  Job Focused Projects  |  Interview Preparation Sessions

Read Reviews

Get Free Consultation

Curriculum

Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well

    Overview of Hadoop & Apache PySpark 

    This will cover the introduction to Hadoop ecosystem with a little insight to Apache Hive.


    Introduction to Big Data and Data Science–

    • Learn about big data and see examples of how data science can leverage big data
    • Performing Data Science and Preparing Data – explore data science definitions and topics, and the process of preparing data

    1) Introduction

    2) PySpark Installation

    3) PySpark & its Features

        > Speed

        > Reusability

        > Advance Analytics

        > In Memory Computation

        > RealTime Stream processing

        > Lazy Evaluation

        > Dynamic in Nature

        > Immutability

        > Partitioning

    4) PySpark with Hadoop

    5) PySpark Components

    6) PySpark Architecture

    1) RDD Overview

    2) RDD Types (Ways to Create RDD):

    • Parallelized RDDs(PairRDDs) > RDD from Collection Objects : Example: map() & flatMap() function so that we able to perform groupByKey() & others....
    • RDD from External Datasets (CSV,JSON,XML.....)
      • rowRDD
      • schemaRDD (Adding schema in RDD particularly Dataframes)

    3) RDD Operations:

    • Basic Operations like map() & flatMap() to convert into pairRDD.
    • Actions
    • Transformations : 

    Narrow Transformation

    Wide Transformation

    map()

    intersect()

    flatMap()

    distinct()

    mapPartition()

    reducebyKey()

    filter()

    groupbyKey()

    sample()

    join()

    union()

    cartesian()


    repartition()


    coalesc()


    Passing Functions to PySpark

    Working with Key-Value Pairs (pairRDDs)

    SuffleRDD : ShuffleOperations Background & Performance Impact

    RDD Persistence & Unpersist

    PySpark Deployment :

        1) Local

        2) Cluster Modes:

            a) Client mode

            b) Cluster mode

    Shared Variables:

        1) Broadcast Variables

        2) Accumulators

    Launching PySpark Jobs from Java/Scala

    Unit Test Cases

    PySpark API :

        1) PySpark Core Module

        2) PySpark SQL Module

        3) PySpark Streaming Module

        4) PySpark ML Module

    Integrating with various Datasources (SQL+ NoSQL + Cloud)

    Dataframes: 

    1) Overview

    • An introduction to PySpark framework in cluster computing space.
    • Reading data from different PySpark sources.
    • PySpark Dataframes Actions & Transformations.
    • PySpark SQL & Dataframes
    • Basic Hadoop functionality
    • Windowing functions in Hive
    • PySpark Architecture and Components
    • PySpark SQL and Dataframes
    • PySpark Data frames
    • PySpark SQL
    • Running PySpark on cluster
    • PySpark Performance Tips


    2) Creating Dataframes using PySparkSession

    3) Untyped Tranformations

    4) Running SQL Queries Programatically

    5) Global Temporary View

    6) Interoperating with RDDs:

    1. Inferring Schema using Reflection
    2. Programatically specifying the Schema


    7) Aggregations:

    1. Untyped User-Defined Aggregate Functions
    2. Type-Safe User-Defined Aggregare Functions

    1) Generic LOAD/Save Functions

    • Manually Specifying Options
    • Run SQL on files directly (avro/parquet files)
    • Saving to Persistent Tables ( persistent tables into HiveMetastore)
    • Bucketing, Sorting & Partitioning (  repartition() &  coalesc() )

     2) Parquet Files

    1.  Loading Data Programatically
    2.  Partition Discovery
    3.  Schema Merging
    4.  Hive Metastore Parquet Table Conversion 
    • Hive Parquet/Schema Reconcialation
    • Metadata Refreshing

    5. Configuration


    3) ORC Files

    4) JSON Datasets

    5) Hive Tables:

    • Specifying storage format for Hive Tables
    • Interacting with various versions of Hive Metastore.

    a) JDBC to other databases (SQL+ NoSQL databases)

    b) Troubleshooting

    c) PySpark SQL Optimizer

    d) Transforming Complex Datatypes

    e) Handling BadRecords & Files 

    f) Task Preemptionh Concurrency  for Hig

    Optional>>Only for DataBricks Cloud


    g) Handling Large Queries in Interactive Flows

    • Only for DataBricks Cloud (Using WatchDog)


    h) Skew Join Optimization

    i) PySpark UDF Scala 

    j) PySpark UDF Python

    k) PySpark UDAF Scala

    l) Peformance Tuning/Optimization PySpark Jobs:

    • Caching Data in Memory
    • Other Configurations like:

                          >maxPartitionBytes

                          >openCostInBytes

                         >broadcastTimeout

                         >autoBroadcastJoinThreshold

                         >partitions

    • BHJ (BroadcastHashJoin)
    • Serializers


    m) Distributed SQL Engine: (Accessing PySpark SQL using JDBC/ODBC using Thrift API)

    1.  Running the Thrift JDBC/ODBC server
    2.  Running the PySpark SQL CLI (Accessing PySpark SQL using SQL shell)


    n) PySpark for Pandas using Apache Arrow:

    1. Apache Arrow in PySpark (To convert PySpark dataframes into Python dataframes)
    2. Conversion to/from Pandas
    3. Pandas UDFs


    o) PySpark SQL compatibility with Apache Hive

    p) Working with DataFrames Python & Scala

    q) Connectivity with various DataSources

    a) Introduction to DataSets

    b) DataSet API - DataSets Operators

    c) DataSet API - Typed Tranformations

    d) Datasets API - Basic Actions

    e) DataSet API - Structured Query with Encoder

    f) Windowing & Aggregation DataSet API

    g) DataSet Checking & Persistence

    h) DataSet Checkpointing

    i) Datsets Vs Dataframes Vs RDDs

    a) Overview

    b) Example to demonstrate DStreams

    c) PySpark Streaming Basic Concepts:

    1. Linking
    2. Initialize Streaming Context
    3. Discretized Streams (DStreams)
    4. Input DStreams & Receivers
    5. Transformations on DStreams
    6. Output Operations on DStreams : print(), saveAsTextFile(), saveAsObjectFiles(), saveAsHadoopFiles(), foreachRDD()
    7. Dataframe & SQL Operations on DStreams (Converting DStreams into Dataframe)
    8. Streaming Machine Learning on streaming data
    9. Caching/Persistence
    10. Checkpointing
    11. Accumulators, Broadcast Variables, and Checkpoints
    12. Deploying PySpark Streaming Applications.
    13. Monitoring PySpark Streaming Applications


    d) Performance Tuning PySpark Streaming Applications:

    1. Reducing the Batch Processing Times
    2. Setting the Right Batch Interval
    3. Memory Tuning


    e) Fault Tolerance Semantics

    f) Integration:

    1. Kafka Integration
    2. Kinesis Integration
    3. Flume Integration


    g) Custom Receivers : creating client/server application PySpark Streaming.


    a) Overview

    b) Example to demonstrate GraphX

    c) PropertyGraph : Example Property Graph

    d) Graph Operators:

    1. Summary List of Operators
    2. Property Operators
    3. Structural Operators
    4. Join Operators
    5. Neighbourhood Aggregation:

                 >Aggregation Messages (aggregateMessages)

                >Map Reduce Triplets Transtion Guide (Legacy)

                >Computing Degree Information

                 >Collecting Neighbours

    6. Caching & Uncaching


    e) Pregel API

    f) Graph Builders

    g) Vertex & Edge RDDs:

    1. VertexRDDs
    2. EdgeRDDs


    h) Optimiation Representation

    i) Graph Algorithms

    1. PageRank
    2. Connected Components
    3. Triangle Counting


    j) Examples

    k) GraphFrames & GraphX

    • How to manage & Monitor Apache Spark on Kubernetes
    • Spark Submit Vs Kubernetes Operator
    • How Spark Submit works with Kubernetes
    • How Kubernetes Operator for Spark Works.
    • Setting up of Hadoop Cluster on Docker
    • Deploying MR, Sqoop & Hive Jobs inside Hadoop Dockerized environment.
    • Building & running applications using PySpark API.
    • PySpark SQL with mySQL (JDBC) source :
    • Now that we have PySpark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use PySpark SQL with Python and a mySQL database input data source.
    • Overview
    • We’re going to load some NYC Uber data into a database. Then, we’re going to fire up PySpark with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.

    ** will also be going through some of the concepts like caching and UDFs.

    A real-world business case built on top of PySpark API. 

    1) Web Server Log Analysis with Apache PySpark – use PySpark to explore a NASA Apache web server log.

    2) Introduction to Machine Learning with Apache PySpark – use PySpark’s MLB Machine Learning library to perform collaborative filtering on a movie dataset

    3) PySpark SQL with New York City Uber Trips CSV Source: PySpark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied by a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.

    Methodology

    We’re going to use the Uber dataset and the PySpark-CSV package available from PySpark Packages to make our lives easier. The PySpark-CSV package is described as a “library for parsing and querying CSV data with Apache PySpark, for PySpark SQL and DataFrames” This library is compatible with PySpark 1.3 and above.

     PySpark SQL with New York City Uber Trips CSV Source

    Software Required :

    ·         VMWare Workstation

    ·         Ubuntu ISO Image setup on Virtual Environment(VMWare)

    ·         Cloudera Quickstart VM (version : 5.3.0)

    ·         Putty Client

    ·         WinSCP

    ·         Hadoop software version 2.6.6

    ·         PySpark 2.x

     

Course Description

    Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. This is making it an inevitable technology and everyone who wants to stay in big data engineering is keep to become an expert in Apache Spark.

    The demand for Apache spark is gaining momentum in current job market. There is huge shortage of Analysts with hands on experience on Apache Spark. This is also the main reason they are drawing handsome salaries.

    Apache Spark will help you understand how Spark executes in-memory data processing & how Spark Job runs faster then Hadoop MapReduce Job. Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well. This course will help you to understand Data Analytics & Machine Learning algorithms applying to various datasets to process & to analyze large amount of data.

    Spark Ecosystem:

    • Understanding Scala in its implementation & in Apache Spark
    • OOPs concepts in Scala Programming language.
    • How to build Scala  & Spark using SBT.
    • Scala & Spark installation .
    • Spark operations on Spark Shell.
    • How to submit Scala in Spark environment.
    • Spark Driver & its related Worker Nodes.
    • Spark + Flume Integration.
    • Setting up Data Pipeline using Apache Flume, Apache Kafka & Spark Streaming
    • Spark RDDs.
    • Spark RDDs Actions & Transformations.
    • Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.
    • Spark Streaming
    • Understanding role of RDD.
    • Spark MLib : Creating Classifiers & Recommendations systems using MLib .
    • Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
    • Spark Architecture & Components.
    • Spark Structure Streaming.
    • Spark SQL experience with CSV , XML & JSON.
    • Reading data from different Spark sources.
    • Spark SQL & Dataframes.
    • Implementing Accumulators & Broadcast variables for Performance tuning.
    • AWS Cloud
    • Deploying BIG Data application in Production environment using Docker & Kubernetes

    Our Big Data Experts have realized that Learning Hadoop standalone doesn’t qualify candidates to clear the interview process. Interviewers demand and expectations from the candidates are more nowadays. They expect proficiency in advanced concepts like-

    • Expertise in PySpark/ Scala-Spark
    • Real Time Data Storage inside Data Lake (Real Time ETL)
    • Big Data Related Services on AWS Cloud
    • Deploying BIG Data application in Production environment using Docker & Kubernetes
    • Experience in Real Time Big Data Projects


    All the advanced level topics will be covered at Gyansetu in a classroom/online Instructor led mode with recordings.

    Knowledge of Python/ Scala, SQL is good to start Spark Training in Gurgaon. However, Gyansetu offers a complementary instructor led course on Python/ Scala, SQL before you start Spark course.

    Gyansetu is providing complimentary placement service to all students. Gyansetu Placement Team consistently work on industry collaboration and associations which help our students to find their dream job right after the completion of training.

    • Our placement team will add Big Data Hadoop, PySpark skills & projects in your CV and update your profile on Job search engines like Naukri, Indeed, Monster, etc. This will increase your profile visibility in top recruiter search and ultimately increase interview calls by 5x.
    • Our faculty offers extended support to students by clearing doubts faced during the interview and preparing them for the upcoming interviews.
    • Gyansetu’s Students are currently working in Companies like Sapient, Capgemini, TCS, Sopra, HCL, Birlasoft, Wipro, Accenture, Zomato, Ola Cabs, Oyo Rooms, etc.

    • Gyansetu trainer’s are well known in Industry; who are highly qualified and currently working in top MNCs.
    • We provide interaction with faculty before the course starts.
    • Our experts help students in learning Technology from basics, even if you are not good in basic programming skills, don’t worry! We will help you.
    • Faculties will help you in preparing project reports & presentations.
    • Students will be provided Mentoring sessions by Experts.

Certification

Apache Spark Certification

APPLY NOW

Reviews

Placement

Enroll Now

Structure your learning and get a certificate to prove it.

Projects

    Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive

    Tools & Techniques used :  Hadoop+HBase+Spark+Flink+Beam+ML stack, Docker & KUBERNETES, Kafka, MongoDB, AVRO, Parquet

    Description : The aim is to create a Batch/Streaming/ML/WebApp stack where you can test your jobs locally or to submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight Web application.

    Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Amazon AWS, Elastic Search, Zookeeper

    Tools & Techniques used :  PySpark MLIB, Spark Streaming, Python (Jupiter Notebook, Anaconda), Machine Learning packages: Numpy, Pandas, Matplot, Seaborn, Sklearn, Random forest and Gradient Boost, Confusing matrix Tableau

    Description : Build a predictive model which will predict fraud transaction on PLCC &DC cards on daily bases. This includes data extraction then data cleaning followed by data pre processing.

     • Pre processing includes standard scaling, means normalizing the data followed by cross validation techniques to check the compatibility of the data.

     • In data modeling, using Decision Tree with Random forest and Gradient Boost hyper parameter tuning techniques to tune our model.

     • In the end, evaluating the mode, by measuring confusion matrix with accuracy of 98% and a trained model, which will show all the fraud transaction on PLCC & DC cards on tableau dashboard.

Apache Spark & Scala certification Training Gurgaon,Delhi - Gyansetu Features

FAQs

    We have seen getting a relevant interview call is not a big challenge in your case. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. We help you prepare your CV by adding relevant projects and skills once 80% of the course is completed. Our placement team will update your profile on Job Portals, this increases relevant interview calls by 5x.

    Interview selection depends on your knowledge and learning. As per the past trend, initial 5 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    We have seen getting a technical interview call is a challenge at times. Most of the time you receive sales job calls/ backend job calls/ BPO job calls. No Worries!! Our Placement team will prepare your CV in such a way that you will have a good number of technical interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. Our placement team will update your profile on Job Portals, this increases relevant interview call by 3x

    Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    We have seen getting a technical interview call is hardly possible. Gyansetu provides internship opportunities to the non-working students so they have some industry exposure before they appear in interviews. Internship experience adds a lot of value to your CV and our placement team will prepare your CV in such a way that you will have a good number of interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training and we will update your profile on Job Portals, this increases relevant interview call by 3x

    Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    Yes, a one-to-one faculty discussion and demo session will be provided before admission. We understand the importance of trust between you and the trainer. We will be happy if you clear all your queries before you start classes with us.


    We understand the importance of every session. Sessions recording will be shared with you and in case of any query, faculty will give you extra time to answer your queries.

    Yes, we understand that self-learning is most crucial and for the same we provide students with PPTs, PDFs, class recordings, lab sessions, etc, so that a student can get a good handle of these topics.

    We provide an option to retake the course within 3 months from the completion of your course, so that you get more time to learn the concepts and do the best in your interviews.


    We believe in the concept that having less students is the best way to pay attention to each student individually and for the same our batch size varies between 5-10 people.


    Yes, we have batches available on weekends. We understand many students are in jobs and it's difficult to take time for training on weekdays. Batch timings need to be checked with our counsellors.

    Yes, we have batches available on weekdays but in limited time slots. Since most of our trainers are working, so either the batches are available in morning hours or in the evening hours. You need to contact our counsellors to know more on this.

    Total duration of the course is 80 hours (40 Hours of live instructor led training and 40 hours of self paced learning).

    You don’t need to pay anyone for software installation, our faculties will provide you all the required softwares and will assist you in the complete installation process.


    Our faculties will help you in resolving your queries during and after the course.

Relevant interested Courses