Online course detail
Curriculum
Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well
- Learn about big data and see examples of how data science can leverage big data
- Performing Data Science and Preparing Data – explore data science definitions and topics, and the process of preparing data
Overview of Hadoop & Apache PySpark
This will cover the introduction to Hadoop ecosystem with a little insight to Apache Hive.
Introduction to Big Data and Data Science–
1) Introduction
2) PySpark Installation
3) PySpark & its Features
> Speed
> Reusability
> Advance Analytics
> In Memory Computation
> RealTime Stream processing
> Lazy Evaluation
> Dynamic in Nature
> Immutability
> Partitioning
4) PySpark with Hadoop
5) PySpark Components
6) PySpark Architecture
- Parallelized RDDs(PairRDDs) > RDD from Collection Objects : Example: map() & flatMap() function so that we able to perform groupByKey() & others....
- RDD from External Datasets (CSV,JSON,XML.....)
- rowRDD
- schemaRDD (Adding schema in RDD particularly Dataframes)
- Basic Operations like map() & flatMap() to convert into pairRDD.
- Actions
- Transformations :
1) RDD Overview
2) RDD Types (Ways to Create RDD):
3) RDD Operations:
Narrow Transformation | Wide Transformation |
map() | intersect() |
flatMap() | distinct() |
mapPartition() | reducebyKey() |
filter() | groupbyKey() |
sample() | join() |
union() | cartesian() |
repartition() | |
coalesc() |
Passing Functions to PySpark
Working with Key-Value Pairs (pairRDDs)
SuffleRDD : ShuffleOperations Background & Performance Impact
RDD Persistence & Unpersist
PySpark Deployment :
1) Local
2) Cluster Modes:
a) Client mode
b) Cluster mode
Shared Variables:
1) Broadcast Variables
2) Accumulators
Launching PySpark Jobs from Java/Scala
Unit Test Cases
PySpark API :
1) PySpark Core Module
2) PySpark SQL Module
3) PySpark Streaming Module
4) PySpark ML Module
Integrating with various Datasources (SQL+ NoSQL + Cloud)
- An introduction to PySpark framework in cluster computing space.
- Reading data from different PySpark sources.
- PySpark Dataframes Actions & Transformations.
- PySpark SQL & Dataframes
- Basic Hadoop functionality
- Windowing functions in Hive
- PySpark Architecture and Components
- PySpark SQL and Dataframes
- PySpark Data frames
- PySpark SQL
- Running PySpark on cluster
- PySpark Performance Tips
- Inferring Schema using Reflection
- Programatically specifying the Schema
- Untyped User-Defined Aggregate Functions
- Type-Safe User-Defined Aggregare Functions
Dataframes:
1) Overview
2) Creating Dataframes using PySparkSession
3) Untyped Tranformations
4) Running SQL Queries Programatically
5) Global Temporary View
6) Interoperating with RDDs:
7) Aggregations:
- Manually Specifying Options
- Run SQL on files directly (avro/parquet files)
- Saving to Persistent Tables ( persistent tables into HiveMetastore)
- Bucketing, Sorting & Partitioning ( repartition() & coalesc() )
- Loading Data Programatically
- Partition Discovery
- Schema Merging
- Hive Metastore Parquet Table Conversion
- Hive Parquet/Schema Reconcialation
- Metadata Refreshing
- Specifying storage format for Hive Tables
- Interacting with various versions of Hive Metastore.
- Only for DataBricks Cloud (Using WatchDog)
- Caching Data in Memory
- Other Configurations like:
- BHJ (BroadcastHashJoin)
- Serializers
- Running the Thrift JDBC/ODBC server
- Running the PySpark SQL CLI (Accessing PySpark SQL using SQL shell)
- Apache Arrow in PySpark (To convert PySpark dataframes into Python dataframes)
- Conversion to/from Pandas
- Pandas UDFs
1) Generic LOAD/Save Functions
2) Parquet Files
5. Configuration
3) ORC Files
4) JSON Datasets
5) Hive Tables:
a) JDBC to other databases (SQL+ NoSQL databases)
b) Troubleshooting
c) PySpark SQL Optimizer
d) Transforming Complex Datatypes
e) Handling BadRecords & Files
f) Task Preemptionh Concurrency for Hig
Optional>>Only for DataBricks Cloud
g) Handling Large Queries in Interactive Flows
h) Skew Join Optimization
i) PySpark UDF Scala
j) PySpark UDF Python
k) PySpark UDAF Scala
l) Peformance Tuning/Optimization PySpark Jobs:
>maxPartitionBytes
>openCostInBytes
>broadcastTimeout
>autoBroadcastJoinThreshold
>partitions
m) Distributed SQL Engine: (Accessing PySpark SQL using JDBC/ODBC using Thrift API)
n) PySpark for Pandas using Apache Arrow:
o) PySpark SQL compatibility with Apache Hive
p) Working with DataFrames Python & Scala
q) Connectivity with various DataSources
a) Introduction to DataSets
b) DataSet API - DataSets Operators
c) DataSet API - Typed Tranformations
d) Datasets API - Basic Actions
e) DataSet API - Structured Query with Encoder
f) Windowing & Aggregation DataSet API
g) DataSet Checking & Persistence
h) DataSet Checkpointing
i) Datsets Vs Dataframes Vs RDDs
- Linking
- Initialize Streaming Context
- Discretized Streams (DStreams)
- Input DStreams & Receivers
- Transformations on DStreams
- Output Operations on DStreams : print(), saveAsTextFile(), saveAsObjectFiles(), saveAsHadoopFiles(), foreachRDD()
- Dataframe & SQL Operations on DStreams (Converting DStreams into Dataframe)
- Streaming Machine Learning on streaming data
- Caching/Persistence
- Checkpointing
- Accumulators, Broadcast Variables, and Checkpoints
- Deploying PySpark Streaming Applications.
- Monitoring PySpark Streaming Applications
- Reducing the Batch Processing Times
- Setting the Right Batch Interval
- Memory Tuning
- Kafka Integration
- Kinesis Integration
- Flume Integration
a) Overview
b) Example to demonstrate DStreams
c) PySpark Streaming Basic Concepts:
d) Performance Tuning PySpark Streaming Applications:
e) Fault Tolerance Semantics
f) Integration:
g) Custom Receivers : creating client/server application PySpark Streaming.
- Summary List of Operators
- Property Operators
- Structural Operators
- Join Operators
- Neighbourhood Aggregation:
- VertexRDDs
- EdgeRDDs
- PageRank
- Connected Components
- Triangle Counting
a) Overview
b) Example to demonstrate GraphX
c) PropertyGraph : Example Property Graph
d) Graph Operators:
>Aggregation Messages (aggregateMessages)
>Map Reduce Triplets Transtion Guide (Legacy)
>Computing Degree Information
>Collecting Neighbours
6. Caching & Uncaching
e) Pregel API
f) Graph Builders
g) Vertex & Edge RDDs:
h) Optimiation Representation
i) Graph Algorithms
j) Examples
k) GraphFrames & GraphX
- How to manage & Monitor Apache Spark on Kubernetes
- Spark Submit Vs Kubernetes Operator
- How Spark Submit works with Kubernetes
- How Kubernetes Operator for Spark Works.
- Setting up of Hadoop Cluster on Docker
- Deploying MR, Sqoop & Hive Jobs inside Hadoop Dockerized environment.
- Building & running applications using PySpark API.
- PySpark SQL with mySQL (JDBC) source :
- Now that we have PySpark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use PySpark SQL with Python and a mySQL database input data source.
- Overview
- We’re going to load some NYC Uber data into a database. Then, we’re going to fire up PySpark with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.
** will also be going through some of the concepts like caching and UDFs.
A real-world business case built on top of PySpark API.
1) Web Server Log Analysis with Apache PySpark –
use PySpark to explore a NASA Apache web server log.
2) Introduction to Machine Learning with Apache
PySpark – use PySpark’s MLB Machine Learning library to
perform collaborative filtering on a movie dataset
3) PySpark SQL with New York City Uber Trips CSV
Source: PySpark SQL uses a type of Resilient Distributed
Dataset called DataFrames which are composed of Row objects accompanied by a
schema. The schema describes the data types of each column. A DataFrame may be
considered similar to a table in a traditional relational database.
Methodology
We’re going to use the Uber dataset and the PySpark-CSV package
available from PySpark Packages to make our lives easier. The PySpark-CSV package is described as a “library for parsing and querying CSV data with
Apache PySpark, for PySpark SQL and DataFrames” This library is compatible with
PySpark 1.3 and above.
PySpark SQL with New York City Uber Trips CSV Source
Software Required :
·
VMWare Workstation
·
Ubuntu ISO Image setup on Virtual
Environment(VMWare)
·
Cloudera Quickstart VM (version : 5.3.0)
·
Putty Client
·
WinSCP
·
Hadoop software version 2.6.6
·
PySpark 2.x
Course Description
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. This is making it an inevitable technology and everyone who wants to stay in big data engineering is keep to become an expert in Apache Spark.
The demand for Apache spark is gaining momentum in current job market. There is huge shortage of Analysts with hands on experience on Apache Spark. This is also the main reason they are drawing handsome salaries.
Apache Spark will help you understand how Spark executes in-memory data processing & how Spark Job runs faster then Hadoop MapReduce Job. Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well. This course will help you to understand Data Analytics & Machine Learning algorithms applying to various datasets to process & to analyze large amount of data.
- Understanding Scala in its implementation & in Apache Spark
- OOPs concepts in Scala Programming language.
- How to build Scala & Spark using SBT.
- Scala & Spark installation .
- Spark operations on Spark Shell.
- How to submit Scala in Spark environment.
- Spark Driver & its related Worker Nodes.
- Spark + Flume Integration.
- Setting up Data Pipeline using Apache Flume, Apache Kafka & Spark Streaming
- Spark RDDs.
- Spark RDDs Actions & Transformations.
- Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.
- Spark Streaming
- Understanding role of RDD.
- Spark MLib : Creating Classifiers & Recommendations systems using MLib .
- Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
- Spark Architecture & Components.
- Spark Structure Streaming.
- Spark SQL experience with CSV , XML & JSON.
- Reading data from different Spark sources.
- Spark SQL & Dataframes.
- Implementing Accumulators & Broadcast variables for Performance tuning.
- AWS Cloud
- Deploying BIG Data application in Production environment using Docker & Kubernetes
Spark Ecosystem:
- Expertise in PySpark/ Scala-Spark
- Real Time Data Storage inside Data Lake (Real Time ETL)
- Big Data Related Services on AWS Cloud
- Deploying BIG Data application in Production environment using Docker & Kubernetes
- Experience in Real Time Big Data Projects
Our Big Data Experts have realized that Learning Hadoop standalone doesn’t qualify candidates to clear the interview process. Interviewers demand and expectations from the candidates are more nowadays. They expect proficiency in advanced concepts like-
All the advanced level topics will be covered at Gyansetu in a classroom/online Instructor led mode with recordings.
Knowledge of Python/ Scala, SQL is good to start Spark Training in Gurgaon. However, Gyansetu offers a complementary instructor led course on Python/ Scala, SQL before you start Spark course.
- Our placement team will add Big Data Hadoop, PySpark skills & projects in your CV and update your profile on Job search engines like Naukri, Indeed, Monster, etc. This will increase your profile visibility in top recruiter search and ultimately increase interview calls by 5x.
- Our faculty offers extended support to students by clearing doubts faced during the interview and preparing them for the upcoming interviews.
- Gyansetu’s Students are currently working in Companies like Sapient, Capgemini, TCS, Sopra, HCL, Birlasoft, Wipro, Accenture, Zomato, Ola Cabs, Oyo Rooms, etc.
Gyansetu is providing complimentary placement service to all students. Gyansetu Placement Team consistently work on industry collaboration and associations which help our students to find their dream job right after the completion of training.
- Gyansetu trainer’s are well known in Industry; who are highly qualified and currently working in top MNCs.
- We provide interaction with faculty before the course starts.
- Our experts help students in learning Technology from basics, even if you are not good in basic programming skills, don’t worry! We will help you.
- Faculties will help you in preparing project reports & presentations.
- Students will be provided Mentoring sessions by Experts.
Certification
Apache Spark CertificationReviews
Placement
Pawan
Placed In:
Myntra
Placed On – February 06 , 2019Review:
Great value for money and learning experience for me. But I liked the faculty here so I joined Big Data course.
Sumit
Placed In:
Big Basket
Placed On – March 31 , 2018Review:
I have lot of friends who studied from here and got placement. Course got completed in 4 months.
Vikram
Placed In:
Tech Mahindra
Placed On – June 16 , 2017Review:
Classes were good and I got placement as well after completion of this Course. Thanks
Enroll Now
Structure your learning and get a certificate to prove it.
Projects
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive
Tools & Techniques used : Hadoop+HBase+Spark+Flink+Beam+ML stack, Docker & KUBERNETES, Kafka, MongoDB, AVRO, Parquet
Description : The aim is to create a Batch/Streaming/ML/WebApp stack where you can test your jobs locally or to submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight Web application.
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Amazon AWS, Elastic Search, Zookeeper
Tools & Techniques used : PySpark MLIB, Spark Streaming, Python (Jupiter Notebook, Anaconda), Machine Learning packages: Numpy, Pandas, Matplot, Seaborn, Sklearn, Random forest and Gradient Boost, Confusing matrix Tableau
Description : Build a predictive model which will predict fraud transaction on PLCC &DC cards on daily bases. This includes data extraction then data cleaning followed by data pre processing.
• Pre processing includes standard scaling, means normalizing the data followed by cross validation techniques to check the compatibility of the data.
• In data modeling, using Decision Tree with Random forest and Gradient Boost hyper parameter tuning techniques to tune our model.
• In the end, evaluating the mode, by measuring confusion matrix with accuracy of 98% and a trained model, which will show all the fraud transaction on PLCC & DC cards on tableau dashboard.
Apache Spark & Scala certification Training Gurgaon,Delhi - Gyansetu Features
FAQs
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a relevant interview call is not a big challenge in your case. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. We help you prepare your CV by adding relevant projects and skills once 80% of the course is completed. Our placement team will update your profile on Job Portals, this increases relevant interview calls by 5x.
Interview selection depends on your knowledge and learning. As per the past trend, initial 5 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a technical interview call is a challenge at times. Most of the time you receive sales job calls/ backend job calls/ BPO job calls. No Worries!! Our Placement team will prepare your CV in such a way that you will have a good number of technical interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. Our placement team will update your profile on Job Portals, this increases relevant interview call by 3x
Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a technical interview call is hardly possible. Gyansetu provides internship opportunities to the non-working students so they have some industry exposure before they appear in interviews. Internship experience adds a lot of value to your CV and our placement team will prepare your CV in such a way that you will have a good number of interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training and we will update your profile on Job Portals, this increases relevant interview call by 3x
Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
Yes, a one-to-one faculty discussion and demo session will be provided before admission. We understand the importance of trust between you and the trainer. We will be happy if you clear all your queries before you start classes with us.
We understand the importance of every session. Sessions recording will be shared with you and in case of any query, faculty will give you extra time to answer your queries.
Yes, we understand that self-learning is most crucial and for the same we provide students with PPTs, PDFs, class recordings, lab sessions, etc, so that a student can get a good handle of these topics.
We provide an option to retake the course within 3 months from the completion of your course, so that you get more time to learn the concepts and do the best in your interviews.
We believe in the concept that having less students is the best way to pay attention to each student individually and for the same our batch size varies between 5-10 people.
Yes, we have batches available on weekends. We understand many students are in jobs and it's difficult to take time for training on weekdays. Batch timings need to be checked with our counsellors.
Yes, we have batches available on weekdays but in limited time slots. Since most of our trainers are working, so either the batches are available in morning hours or in the evening hours. You need to contact our counsellors to know more on this.
Total duration of the course is 80 hours (40 Hours of live instructor led training and 40 hours of self paced learning).
You don’t need to pay anyone for software installation, our faculties will provide you all the required softwares and will assist you in the complete installation process.
Our faculties will help you in resolving your queries during and after the course.