Contact Us

Hide

Online course detail

Data Analytics Course in Gurgaon, Delhi - Job Oriented

Advance your career in Data Analytics with job-focused skills in Machine Learning, Python, Big data, Data Visualization using Power BI. A study states that over 2.5 quintillion bytes of unstructured or semi-structured data is generated every single day. Effective algorithms are required to process these humongous data. Machine Learning algorithms, statistical and mathematical techniques perform analytics on data sets to detect patterns and predictions.

Instructor Led Training  |  Free Course Repeat  |  Placement Assistance  |  Job Focused Projects  |  Interview Preparation Sessions

Read Reviews

Connect With Us

Curriculum

Topics to be covered - Machine Learning, Big Data, Power BI

    • Statistical learning vs. Machine learning
    • Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
    • Different Phases of Predictive Modelling (Data Pre-processing, Sampling, Model Building, Validation)
    • Concept of Overfitting and Under fitting (Bias-Variance Trade off) & Performance Metrics
    • Types of Cross validation(Train & Test, Bootstrapping, K-Fold validation etc)
    • Iteration and Model Evaluation
    • Python Numpy (Data Manipulation)
    • Python Pandas(Data Extraction & Cleansing)
    • Python Matplot (Data Visualization)
    • Python Scikit-Learn (Data Modelling)
    • EDA – Quantitative Technique
    • Data Exploration Techniques
    • Sea-born | Matplotlib
    • Correlation Analysis
    • Data Wrangling
    • Outliers Values in a DataSet
    • Data Manipulation
    • Missing & Categorical Data
    • Splitting the Data into Training Set & Test Set
    • Feature Scaling
    • Concept of Over fitting and Under fitting (Bias-Variance Trade off) & Performance Metrics
    • Types of Cross validation(Train & Test, Bootstrapping, K-Fold validation etc)
    • Basic Data Structure & Data Types in Python language.
    • Working with data frames and Data handling packages.
    • Importing Data from various file sources like csv, txt, Excel, HDFS and other files types.
    • Reading and Analysis of data and various operations for data analysis.
    • Exporting files into different formats.
    • Data Visualization and concept of tidy data.
    • Handling Missing Information.
    • Calls Data Capstone Project
    • Finance Project : Perform EDA of stock prices. We will focus on BANK Stocks(JPMorgan, Bank Of America, Goldman Sachs, Morgan Stanley, Wells Fargo) & see how they progressed throughout the financial crisis all the way to early 2016.
    • Fundamental of descriptive Statistics and Hypothesis testing (t-test, z-test).
    • Probability Distribution and analysis of Variance.
    • Correlation and Regression.
    • Linear Modeling.
    • Advance Analytics.
    • Poisson and logistic Regression
    • Feature Selection
    • Principal Component Analysis(PCA)
    • Linear Discriminant Analysis (LDA)
    • Kernel PCA
    • Feature Reduction
    • Simple Linear Regression
    • Multiple Linear Regression
    • Perceptron Algorithm
    • Regularization
    • Recursive Partitioning (Decision Trees)
    • Ensemble Models (Random Forest , Bagging & Boosting (ADA, GBM)
    • Ensemble Learning Methods
    • Working of Ada Boost
    • AdaBoost Algorithm & Flowchart
    • Gradient Boosting
    • XGBoost
    • Polynomial Regression
    • Support Vector Regression (SVR)
    • Decision Tree Regression
    • Evaluating Regression Models Performance
    • Logistic Regression
    • K-Nearest Neighbours(K-NN)
    • Support Vector Machine(SVM)
    • Kernel SVM
    • Naive Bayes
    • Decision Tree Classification
    • Random Forest Classification
    • Evaluating Classification Models Performance
    • K-Means Clustering
    • Challenges of Unsupervised Learning and beyond K-Means
    • Hierarchical Clustering
    • Purpose of Recommender Systems
    • Collaborative Filtering
    • Market Basket Analysis
    • Collaborative Filtering
    • Content-Based Recommendation Engine
    • Popularity Based Recommendation Engine
    • Anomaly Detection and Time Series Analysis
    • Upper Confidence Bound (UCB)
    • Thompson Sampling
    • Spacy Basics
    • Tokenization
    • Stemming
    • Lemmatization
    • Stop-Words
    • Vocabulary-and-Matching
    • NLP-Basics Assessment
    • TF-IDF
    • Understanding Word Vectors
    • Training the Word2Vec Model
    • Exploring Pre-trained Models
    • POS-Basics
    • Visualizing POS
    • NER-Named-Entity-Recognition
    • Visualizing NER
    • Sentence Segmentation
    • Power BI Installation
    • Power BI Desktop
    • Data Imports into Power BI
    • Views in Power BI
    • Building a Dashboard
    • Publishing Reports
    • Creating Dashboard in Power BI
    • Power Query
    • Power Pivot
    • Power View
    • Power Map
    • Power BI Service
    • Power BI & QA
    • Data Management Gateway
    • Data Catalog
    • Connect to DataSources
    • Clean and Transform using Query Editor
    • Advanced Datasources and transformations
    • Cleaning irregularly formatted data.
    • Bins
    • Change Datatype of column
    • Combine Multiple tables
    • Clusters
    • Format dates
    • Groups
    • Hierarchies
    • Joins
    • Pivot table
    • Query Groups
    • Split Columns
    • Unpivot table
    • How Data Model Looks Like
    • Database Normalization
    • Data Tables vs Lookup Tables
    • Primary Key vs Foreign Key
    • Relationships vs Merged Table
    • Creating Table Relationships
    • Creating Snowflake Schemas
    • Managing & Editing Relationships
    • Active vs Inactive Relationships
    • Relationship Cardinality
      • Many to Many
      • One to One
    • Connecting Multiple Data Tables
    • Filter Flow
    • Two Way Filter
    • Two Way Filters Conditions
    • Hiding Fields from Report View
    • Area Chart
    • Bar chart
    • Card
    • Clustered Bar chart
    • Clustered Column chart
    • Donut chart
    • Funnel chart
    • Heat Map
    • Line Chart
    • Clustered Column and Line chart
    • Line and Stacked Column chart
    • Matrix
    • Multi-Row Card
    • Pie chart
    • Ribbon chart
    • Stacked Area char
    • Scatter Chart
    • Stacked Bar chart
    • Waterfall chart
    • Map
    • Filled Map
    • Slicer Basic
    • Filters
    • Advanced Filters
    • Top N Filters
    • Filters on Measures
    • Page-Level Filters
    • Report Level Filters
    • Drill through Filters
    • Visuals in PowerBI
    • Create and Customize simple visualizations
    • Combination charts
    • Slicers
    • Map visualizations
    • Matrixes and tables
    • Scatter charts
    • Waterfall and funnel charts
    • Gauges and single-number cards
    • Modify colors in charts and visuals
    • Z-Order
    • Duplicate a report page
    •  PowerBI Service
    • QuickSights in PowerBI
    • Create and configure a dashboard
    • Share dashboards with organizations
    • Install and configure the personal gateway
    • Using Excel Data in PowerBI
    • Upload Excel Data to PowerBI
    • Import Power View and Power Pivot to PowerBI
    • Connect OneDrive for Business to PowerBI
    • Introduction to content packs, security, and groups
    • Publish PowerBI Desktop reports
    • Print and export dashboards and reports
    • Manually republish and refresh your data
    • Introduction PowerBI Mobile
    • Creating groups in PowerBI
    • Build content packs
    • Use content packs
    • Update content packs
    • Integrate OneDrive for Business with PowerBI
    • Publish to web

     


    • Introduction to DAX
    • DAX calculation types
    • DAX functions:

                           a) Aggregate Functions

                           b) Date Functions

                           c) Logical Functions

                           d) String Functions

                           e) Trigonometric Functions

    https://gblobscdn.gitbook.com/assets%2F-M1TdXqfbdvRTKvqjBIl%2F-M4J1TxZNqKcpMnFGGU2%2F-M4J1VzOBRiV_TSvHAf9%2Fimage.png?alt=media&token=81ec9276-e342-4cf8-96f3-d0d784baccd5

    • Using variables in DAX expressions
    • Table relationships and DAX
    • DAX tables and filtering




    • PowerBI SQL Server Integration
    • PowerBI Mysql Integration
    • PowerBI Excel Integration
    • R Integration with PowerBI Desktop
    • Objects & Charts
    • Formatting Charts
    • Report Interactions
    • Bookmarks
    • Managing Roles
    • Custom Visuals
    • Desktop vs Phone Layout

    This module will help you understand how to configure Hadoop Cluster on AWS Cloud:

    1. Introduction to Amazon Elastic MapReduce
    2. AWS EMR Cluster
    3. AWS EC2 Instance: Multi-Node Cluster Configuration
    4. AWS EMR Architecture
    5. Web Interfaces on Amazon EMR
    6. Amazon S3
    7. Executing MapReduce Job on EC2 & EMR
    8. Apache Spark on AWS, EC2 & EMR
    9. Submitting Spark Job on AWS
    10. Hive on EMR
    11. Available Storage types: S3, RDS & DynamoDB
    12. Apache Pig on AWS EMR
    13. Processing NY Taxi Data using SPARK on Amazon EMR

    This module will help you understand Big Data:

    1. Common Hadoop ecosystem components
    2. Hadoop Architecture
    3. HDFS Architecture
    4. Anatomy of File Write and Read
    5. How MapReduce Framework works
    6. Hadoop high-level Architecture
    7. MR2 Architecture
    8. Hadoop YARN
    9. Hadoop 2.x core components
    10. Hadoop Distributions
    11. Hadoop Cluster Formation

    This module will help you to understand Hadoop & HDFS Cluster Architecture:

    1. Configuration files in Hadoop Cluster (FSimage & edit log file)
    2. Setting up of Single & Multi-node Hadoop Cluster
    3. HDFS File permissions
    4. HDFS Installation & Shell Commands
    5. Daemons of HDFS
      1. Node Manager
      2. Resource Manager
      3. NameNode
      4. DataNode
      5. Secondary NameNode
      6.  YARN Daemons
      7. HDFS Read & Write Commands
      8. NameNode & DataNode Architecture
      9. HDFS Operations
      10. Hadoop MapReduce Job
      11. Executing MapReduce Job

    This module will help you to understand Hadoop MapReduce framework:

    1. How MapReduce works on HDFS data sets
    2. MapReduce Algorithm
    3. MapReduce Hadoop Implementation
    4. Hadoop 2.x MapReduce Architecture
    5. MapReduce Components
    6. YARN Workflow
    7. MapReduce Combiners
    8. MapReduce Partitioners
    9. MapReduce Hadoop Administration
    10. MapReduce APIs
    11. Input Split & String Tokenizer in MapReduce
    12. MapReduce Use Cases on Data sets
    1. Hive Installation
    2. Hive Data types
    3. Hive Architecture & Components
    4. Hive Meta Store
    5. Hive Tables(Managed Tables and External Tables)
    6. Hive Partitioning & Bucketing
    7. Hive Joins & Sub Query
    8. Running Hive Scripts
    9. Hive Indexing & View
    10. Hive Queries (HQL); Order By, Group By, Distribute By, Cluster By, Examples
    11. Hive Functions: Built-in & UDF (User Defined Functions)
    12. Hive ETL: Loading JSON, XML, Text Data Examples
    13. Hive Querying Data
    14. Hive Tables (Managed & External Tables)
    15. Hive Used Cases
    16. Hive Optimization Techniques
      1. Partioning(Static & Dynamic Partition) & Bucketing
      2. Hive Joins > Map + BucketMap + SMB (SortedBucketMap) + Skew
      3. Hive FileFormats ( ORC+SEQUENCE+TEXT+AVRO+PARQUET)
      4. CBO
      5. Vectorization
      6. Indexing (Compact + BitMap)
      7. Integration with TEZ & Spark
    17. Hive SerDer ( Custom + InBuilt)
    18. Hive integration NoSQL (HBase + MongoDB + Cassandra)
    19. Thrift API (Thrift Server)
    20. Hive LATERAL VIEW
    21. Incremental Updates & Import in Hive  Hive Functions: 
      1.  LATERAL VIEW EXPLODE   
      2. 2) LATERAL VIEW JSON_TUPLE ...........others...
    22. Hive SCD Strategies :1) Type - 1      2) Type – 2         3) TYpe - 3
    23. UDF, UDTF & UDAF
    24. Hive Multiple Delimiters
    25. XML & JSON Data Loading HIVE.
    26. Aggregation & Windowing Functions in Hive
    27. Hive integration NoSQL(HBase + MongoDB + Cassandra)
    28. Hive Connect with Tableau
    1. Sqoop Installation
    2. Loading Data form RDBMS using Sqoop
    3. Fundamentals & Architecture of Apache Sqoop
    4. Sqoop Tools
      1. Sqoop Import & Import-All-Table
      2. Sqoop Job
      3. Sqoop Codegen
      4. Sqoop Incremental Import & Incremental Export
      5. Sqoop  Merge
      6. Sqoop : Hive Import
      7. Sqoop Metastore
      8. Sqoop Export
    5. Import Data from MySQL to Hive using Sqoop
    6. Sqoop: Hive Import
    7. Sqoop Metastore
    8. Sqoop Use Cases
    9. Sqoop- HCatalog Integration
    10. Sqoop Script
    11. Sqoop Connectors
    12. Batch Processing in Sqoop
    13. SQOOP Incremental Import
    14. Boundary Queries in Sqoop
    15. Controlling Parallelism in Sqoop
    16. Import Join Tables from SQL databases to Warehouse using Sqoop
    17. Sqoop Hive/HBase/HDFS integration
    1. Pig Architecture
    2. Pig Installation
    3. Pig Grunt shell
    4. Pig Running Modes
    5. Pig Latin Basics
    6. Pig LOAD & STORE Operators
    7. Diagnostic Operators
      1. DESCRIBE Operator
      2. EXPLAIN Operator
      3. ILLUSTRATE Operator
      4. DUMP Operator
    8. Grouping & Joining
      1. GROUP Operator
      2. COGROUP Operator
      3. JOIN Operator
      4. CROSS Operator
    9. Combining & Splitting
      1. UNION Operator
      2. SPLIT Operator
    10. Filtering
      1. FILTER Operator
      2. DISTINCT Operator
      3. FOREACH Operator
    11. Sorting
      1. ORDERBYFIRST
      2. LIMIT Operator
    12. Built in Fuctions
      1. EVAL Functions
      2. LOAD & STORE Functions
      3. Bag & Tuple Functions
      4. String Functions
      5. Date-Time Functions
      6. MATH Functions
    13. Pig UDFs (User Defined Functions)
    14. Pig Scripts in Local Mode
    15. Pig Scripts in MapReduce Mode
    16. Analysing XML Data using Pig
    17. Pig Use Cases (Data Analysis on Social Media sites, Banking, Stock Market & Others)
    18. Analysing JSON data using Pig
    19. Testing Pig Sctipts
    1. KAFKA Confluent HUB
    2. KAFKA Confluent Cloud
    3. KStream APIs
    4. Difference between Apache KAFKA / Confluence KAFKA
    5. KSQL (SQL Engine for Kafka)
    6. Developing Real-time application using KStream APIs
    7. KSQL (SQL Engine for Kafka)
    8. Kafka Connectors
    9. Kafka REST Proxy
    10. Kafka Offsets
    1. Oozie Introduction
    2. Oozie Workflow Specification
    3. Oozie Coordinator Functional Specification
    4. Oozie H-catalog Integration
    5. Oozie Bundle Jobs
    6. Oozie CLI Extensions
    7. Automate MapReduce, Pig, Hive, Sqoop Jobs using Oozie
    8. Packaging & Deploying an Oozie Workflow Application
    1. Apache Airflow Installation
    2. Work Flow Design using Airflow
    3. Airflow DAG
    4. Module Import in Airflow
    5. Airflow Applications
    6. Docker Airflow
    7. Airflow Pipelines
    8. Airflow KUBERNETES Integration
    9. Automating Batch & Real Time Jobs using Airflow
    10. Data Profiling using Airflow
    11. Airflow Integration:
      1. AWS EMR
      2. AWS S3
      3. AWS Redshift
      4. AWS DynamoDB
      5. AWS Lambda
      6. AWS Kines
    12. Scheduling of PySpark Jobs using Airflow
    13. Airflow Orchestration
    14. Airflow Schedulers & Triggers
    15. Gantt Chart in Apache Airflow
    16. Executors in Apache Airflow
    17. Airflow Metrices
    1. Spark RDDs Actions & Transformations.
    2. Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.
    3. Spark Streaming
    4. Understanding role of RDD
    5. Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
    6. Spark Architecture & Components.


      • AWS Lambda:
        • AWS Lambda Introduction
        • Creating Data Pipelines using AWS Lambda & Kinesis
        • AWS Lambda Functions
        • AWS Lambda Deployment


      • AWS GLUE :
        • GLUE Context
        • AWS Data Catalog
        • AWS Athena
        • AWS Quiksight


      • AWS Kinesis
      • AWS S3
      • AWS Redshift
      • AWS EMR & EC2
      • AWS ECR & AWS Kubernetes
    1. How to manage & Monitor Apache Spark on Kubernetes
    2. Spark Submit Vs Kubernetes Operator
    3. How Spark Submit works with Kubernetes
    4. How Kubernetes Operator for Spark Works.
    5. Setting up of Hadoop Cluster on Docker
    6. Deploying IMR , Sqoop & Hive Jobs inside Hadoop Dockerized environment.

    1) Docker Installation

    2) Docker Hub

    3) Docker Images

    4) Docker Containers & Shells

    5) Working with Docker Containers

    6) Docker Architecture

    7) Docker Push & Pull containers

    8) Docker Container & Hosts

    9) Docker Configuration

    10) Docker Files (DockerFile)

    11) Docker Building Files

    12) Docker Public Repositories

    13) Docker Private Registeries

    14) Building WebServer using DockerFile

    15) Docker Commands

    16) Docker Container Linking?

    17) Docker Storage

    18) Docker Networking

    19) Docker Cloud

    20) Docker Logging

    21) Docker Compose

    22) Docker Continuous Integration

    23) Docker Kubernetes Integration

    24) Docker Working of Kubernetes

    25) Docker on AWS

    1) Overview

    2) Learn Kubernetes Basics

    3) Kubernetes Installation

    4) Kubernetes Architecture

    5) Kubernetes Master Server Components

    a) etcd

    b) kube-apiserver

    c) kube-controller-manager

    d) kube-scheduler

    e) cloud-controller-manager


    6) Kubernetes Node Server Components

    a) A container runtime

    b) kubelet

    c) kube-proxy

    d) kube-scheduler

    e) cloud-controller-manager


    7) Kubernetes Objects & Workloads

    a) Kubernetes Pods

    b) Kubernetes Replication Controller & Replica Sets


    8) Kubernetes Images

    9) Kubernetes Labels & Selectors

    10) Kubernetes Namespace

    11) Kubernetes Service

    12) Kubernetes Deployments

    a) Stateful Sets

    b) Daemon Sets

    c) Jobs & Cron Jobs


    13) Other Kubernetes Components:

    a) Services

    b) Volume & Persistent Volumes

    c) Lables, Selectors & Annotations

    d) Kubernetes Secrets

    e) Kubernetes Network Policy


    14) Kubernetes API

    15) Kubernetes Kubectl

    16) Kubernetes Kubectl Commands

    17) Kubernetes Creating an App

    18) Kubernetes App Deployment

    19) Kubernetes Autoscaling

    20) Kubernetes Dashboard Setup

    21) Kubernetes Monitoring

    22) Federation using kubefed

    • CREATING DATA PIPELINES USING CLOUD & ON-PREMISE INFRASTRUCTURES
    • HYBRID INFRASTRUCTURES
    • Automation Test Cases(TDD & BDD Test Cases for Spark Applications)
    • Data Security
    • Data Governance
    • Deployment Automation using CI/CD Pipelines with Docker & KUBERNETES.
    • DESIGNING BATCH & REAL-TIME APPLICATION
    • LATENCY & OPTIMIZATION in REAL-TIME APPLICATIONS
    • Resolving Latency & Optimization issues using other Streaming Platforms like :

    Overview of Hadoop & Apache PySpark 

    This will cover the introduction to the Hadoop ecosystem with a little insight into Apache Hive.


    Introduction to Big Data and Data Science–

    • Learn about big data and see examples of how data science can leverage big data
    • Performing Data Science and Preparing Data – explore data science definitions and topics, and the process of preparing data

    1) Introduction

    2) PySpark Installation

    3) PySpark & its Features

        > Speed

        > Reusability

        > Advance Analytics

        > In Memory Computation

        > RealTime Stream processing

        > Lazy Evaluation

        > Dynamic in Nature

        > Immutability

        > Partitioning

    4) PySpark with Hadoop

    5) PySpark Components

    6) PySpark Architecture

    1) RDD Overview

    2) RDD Types (Ways to Create RDD):

    • Parallelized RDDs(PairRDDs) > RDD from Collection Objects : Example: map() & flatMap() function so that we able to perform groupByKey() & others....
    • RDD from External Datasets (CSV,JSON,XML.....)
      • rowRDD
      • schemaRDD (Adding schema in RDD particularly Dataframes)

    3) RDD Operations:

    • Basic Operations like map() & flatMap() to convert into pairRDD.
    • Actions
    • Transformations : 

    Narrow Transformation

    Wide Transformation

    map()

    intersect()

    flatMap()

    distinct()

    mapPartition()

    reducebyKey()

    filter()

    groupbyKey()

    sample()

    join()

    union()

    cartesian()


    repartition()


    coalesc()


    Passing Functions to PySpark

    Working with Key-Value Pairs (pairRDDs)

    SuffleRDD : ShuffleOperations Background & Performance Impact

    RDD Persistence & Unpersist

    PySpark Deployment :

        1) Local

        2) Cluster Modes:

            a) Client mode

            b) Cluster mode

    Shared Variables:

        1) Broadcast Variables

        2) Accumulators

    Launching PySpark Jobs from Java/Scala

    Unit Test Cases

    PySpark API :

        1) PySpark Core Module

        2) PySpark SQL Module

        3) PySpark Streaming Module

        4) PySpark ML Module

    Integrating with various Datasources (SQL+ NoSQL + Cloud)

    Dataframes: 

    1) Overview

    • An introduction to the PySpark framework in cluster computing space.
    • Reading data from different PySpark sources.
    • PySpark Dataframes Actions & Transformations.
    • PySpark SQL & Dataframes
    • Basic Hadoop functionality
    • Windowing functions in Hive
    • PySpark Architecture and Components
    • PySpark SQL and Dataframes
    • PySpark Data frames
    • PySpark SQL
    • Running PySpark on cluster
    • PySpark Performance Tips


    2) Creating Dataframes using PySparkSession

    3) Untyped Transformations

    4) Running SQL Queries Programmatically

    5) Global Temporary View

    6) Interoperating with RDDs:

    1. Inferring Schema using Reflection
    2. Programatically specifying the Schema


    7) Aggregations:

    1. Untyped User-Defined Aggregate Functions
    2. Type-Safe User-Defined Aggregare Functions

    1) Generic LOAD/Save Functions

    • Manually Specifying Options
    • Run SQL on files directly (avro/parquet files)
    • Saving to Persistent Tables ( persistent tables into HiveMetastore)
    • Bucketing, Sorting & Partitioning (  repartition() &  coalesc() )

     2) Parquet Files

    1.  Loading Data Programatically
    2.  Partition Discovery
    3.  Schema Merging
    4.  Hive Metastore Parquet Table Conversion 
    • Hive Parquet/Schema Reconcialation
    • Metadata Refreshing

    5. Configuration


    3) ORC Files

    4) JSON Datasets

    5) Hive Tables:

    • Specifying storage format for Hive Tables
    • Interacting with various versions of Hive Metastore.

    a) JDBC to other databases (SQL+ NoSQL databases)

    b) Troubleshooting

    c) PySpark SQL Optimizer

    d) Transforming Complex Datatypes

    e) Handling BadRecords & Files 

    f) Task Preemptionh Concurrency  for Hig

    Optional>>Only for DataBricks Cloud


    g) Handling Large Queries in Interactive Flows

    • Only for DataBricks Cloud (Using WatchDog)


    h) Skew Join Optimization

    i) PySpark UDF Scala 

    j) PySpark UDF Python

    k) PySpark UDAF Scala

    l) Peformance Tuning/Optimization PySpark Jobs:

    • Caching Data in Memory
    • Other Configurations like:

                          >maxPartitionBytes

                          >openCostInBytes

                         >broadcastTimeout

                         >autoBroadcastJoinThreshold

                         >partitions

    • BHJ (BroadcastHashJoin)
    • Serializers


    m) Distributed SQL Engine: (Accessing PySpark SQL using JDBC/ODBC using Thrift API)

    1.  Running the Thrift JDBC/ODBC server
    2.  Running the PySpark SQL CLI (Accessing PySpark SQL using SQL shell)


    n) PySpark for Pandas using Apache Arrow:

    1. Apache Arrow in PySpark (To convert PySpark data frames into Python data frames)
    2. Conversion to/from Pandas
    3. Pandas UDFs


    o) PySpark SQL compatibility with Apache Hive

    p) Working with DataFrames Python & Scala

    q) Connectivity with various DataSources

    a) Introduction to DataSets

    b) DataSet API - DataSets Operators

    c) DataSet API - Typed Tranformations

    d) Datasets API - Basic Actions

    e) DataSet API - Structured Query with Encoder

    f) Windowing & Aggregation DataSet API

    g) DataSet Checking & Persistence

    h) DataSet Checkpointing

    i) Datsets Vs Dataframes Vs RDDs

    a) Overview

    b) Example to demonstrate DStreams

    c) PySpark Streaming Basic Concepts:

    1. Linking
    2. Initialize Streaming Context
    3. Discretized Streams (DStreams)
    4. Input DStreams & Receivers
    5. Transformations on DStreams
    6. Output Operations on DStreams : print(), saveAsTextFile(), saveAsObjectFiles(), saveAsHadoopFiles(), foreachRDD()
    7. Dataframe & SQL Operations on DStreams (Converting DStreams into Dataframe)
    8. Streaming Machine Learning on streaming data
    9. Caching/Persistence
    10. Checkpointing
    11. Accumulators, Broadcast Variables, and Checkpoints
    12. Deploying PySpark Streaming Applications.
    13. Monitoring PySpark Streaming Applications


    d) Performance Tuning PySpark Streaming Applications:

    1. Reducing the Batch Processing Times
    2. Setting the Right Batch Interval
    3. Memory Tuning


    e) Fault Tolerance Semantics

    f) Integration:

    1. Kafka Integration
    2. Kinesis Integration
    3. Flume Integration


    g) Custom Receivers : creating client/server application PySpark Streaming.


    a) Overview

    b) Example to demonstrate GraphX

    c) PropertyGraph: Example Property Graph

    d) Graph Operators:

    1. Summary List of Operators
    2. Property Operators
    3. Structural Operators
    4. Join Operators
    5. Neighborhood Aggregation:

                 >Aggregation Messages (aggregate messages)

                >Map Reduce Triplets Transition Guide (Legacy)

                >Computing Degree Information

                 >Collecting Neighbours

    6. Caching & Uncaching


    e) Pregel API

    f) Graph Builders

    g) Vertex & Edge RDDs:

    1. VertexRDDs
    2. EdgeRDDs


    h) Optimisation Representation

    i) Graph Algorithms

    1. PageRank
    2. Connected Components
    3. Triangle Counting


    j) Examples

    k) GraphFrames & GraphX

    • How to manage & Monitor Apache Spark on Kubernetes
    • Spark Submit Vs Kubernetes Operator
    • How Spark Submit works with Kubernetes
    • How Kubernetes Operator for Spark Works.
    • Setting up of Hadoop Cluster on Docker
    • Deploying MR, Sqoop & Hive Jobs inside Hadoop Dockerized environment.
    • Building & running applications using PySpark API.
    • PySpark SQL with mySQL (JDBC) source :
    • Now that we have PySpark SQL experience with CSV and JSON, connecting and using a MySQL database will be easy. So, let’s cover how to use PySpark SQL with Python and a MySQL database input data source.
    • Overview
    • We’re going to load some NYC Uber data into a database. Then, we’re going to fire up PySpark with a command-line argument to specify the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.

    ** will also be going through some of the concepts like caching and UDFs.

    A real-world business case built on top of PySpark API. 

    1) Web Server Log Analysis with Apache PySpark – use PySpark to explore a NASA Apache web server log.

    2) Introduction to Machine Learning with Apache PySpark – use PySpark’s MLB Machine Learning library to perform collaborative filtering on a movie dataset

    3) PySpark SQL with New York City Uber Trips CSV Source: PySpark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied by a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.

    Methodology

    We’re going to use the Uber dataset and the PySpark-CSV package available from PySpark Packages to make our lives easier. The PySpark-CSV package is described as a “library for parsing and querying CSV data with Apache PySpark, for PySpark SQL and DataFrames” This library is compatible with PySpark 1.3 and above.

     PySpark SQL with New York City Uber Trips CSV Source

    Software Required :

    ·         VMWare Workstation

    ·         Ubuntu ISO Image setup on Virtual Environment(VMWare)

    ·         Cloudera Quickstart VM (version: 5.3.0)

    ·         Putty Client

    ·         WinSCP

    ·         Hadoop software version 2.6.6

    ·         PySpark 2.x

     

Course Description

    Data Analytics Certification is designed to build your expertise in Data Science concepts like Machine Learning, Python, Big Data Analytics using Spark, Reporting tools like Power BI, Tableau. The course is designed for professionals looking for a 360 degree change in Data Engineering, Data Analytics and Data Visualization.

    After the completion of Data Analytics Course, you will be able to:

    • Understand Scala & Apache Spark implementation
    • Spark operations on Spark Shell
    • Spark Driver & its related Worker Nodes
    • Spark + Flume Integration
    • Setting up Data Pipeline using Apache Flume, Apache Kafka & Spark Streaming
    • Spark RDDs and Spark Streaming
    • Spark MLib : Creating Classifiers & Recommendations systems using MLib 
    • Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
    • Spark Architecture & Components
    • Spark SQL experience with CSV , XML & JSON
    • Reading data from different Spark sources
    • Spark SQL & Dataframes
    • Develop and Implement various Machine Learning Algorithms in daily practices & Live Environment
    • Building Recommendation systems and Classifiers
    • Perform various type of Analysis (Prediction & Regression)
    • Implement plotting & graphs using various Machine Learning Libraries
    • Import data from HDFS & Implement various Machine Learning Models
    • Building different Neural networks using NumPy and TensorFlow
    • Power BI Visualization
    • Power BI Components
    • Power BI Transformations
    • Dax functions
    • Data Exploration and Mapping
    • Designing Dashboards
    • Time Series, Aggregation & Filters

    We at Gyansetu understand that teaching any course is not difficult but to make someone job-ready is the essential task. That's why we have prepared capstone projects which will drive your learning through real-time industry scenarios and help you clearing interviews.

    All the advanced level topics will be covered at Gyansetu in a classroom/online Instructor-led mode with recordings.

    No prerequisites. This course is for beginners.

    Gyansetu is providing complimentary placement service to all students. Gyansetu Placement Team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training.

    • Our placement team will add Big Data skills & projects to your CV and update your profile on Job search engines like Naukri, Indeed, Monster, etc. This will increase your profile visibility in top recruiter search and ultimately increase interview calls by 5x.
    • Our faculty offers extended support to students by clearing doubts faced during the interview and preparing them for the upcoming interviews.
    • Gyansetu’s Students are currently working in Companies like Sapient, Capgemini, TCS, Sopra, HCL, Birlasoft, Wipro, Accenture, Zomato, Ola Cabs, Oyo Rooms, etc.


    • Gyansetu trainers are well known in Industry; who are highly qualified and currently working in top MNCs.
    • We provide interaction with faculty before the course starts.
    • Our experts help students in learning Technology from basics, even if you are not good at basic programming skills, don’t worry! We will help you.
    • Faculties will help you in preparing project reports & presentations.
    • Students will be provided Mentoring sessions by Experts.

Certification

Data Analytics Certification

APPLY NOW

Reviews

Placement

Enroll Now

Structure your learning and get a certificate to prove it.

Projects

    Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Amazon AWS, Elastic Search, Zookeeper

    Tools & Techniques used :  PySpark MLIB,Spark Streaming, Python (Jupiter Notebook, Anaconda), Machine Learning packages: Numpy, Pandas, Matplot, Seaborn, Sklearn ,Random forest and Gradient Boost, Confusing matrix Tableau

    Description: Build a predictive model which will predict fraud transaction on PLCC &DC cards on daily bases. This includes data extraction then data cleaning followed by data pre-processing.

    • Pre-processing includes standard scaling, means normalizing the data followed by cross-validation techniques to check the compatibility of the data.
    • In data modeling, using Decision Tree with Random forest and Gradient Boost hyperparameter tuning techniques to tune our model.
    • In the end, evaluating the mode, by measuring confusion matrix with the accuracy of 98% and a trained model, which will show all the fraud transaction on PLCC & DC cards on the tableau dashboard.

    Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive

    Tools & Techniques used :  Hadoop+HBase+Spark+Flink+Beam+ML stack, Docker & KUBERNETES, Kafka, MongoDB, AVRO, Parquet

    Description: The aim is to create a Batch/Streaming/ML/WebApp stack where you can test your jobs locally or submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight Web application.

Data Analytics Course in Gurgaon, Delhi - Job Oriented Features

FAQs

    We have seen getting a relevant interview call is not a big challenge in your case. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. We help you prepare your CV by adding relevant projects and skills once 80% of the course is completed. Our placement team will update your profile on Job Portals, this increases relevant interview calls by 5x.

    Interview selection depends on your knowledge and learning. As per the past trend, initial 5 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    We have seen getting a technical interview call is a challenge at times. Most of the time you receive sales job calls/ backend job calls/ BPO job calls. No Worries!! Our Placement team will prepare your CV in such a way that you will have a good number of technical interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. Our placement team will update your profile on Job Portals, this increases relevant interview call by 3x

    Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    We have seen getting a technical interview call is hardly possible. Gyansetu provides internship opportunities to the non-working students so they have some industry exposure before they appear in interviews. Internship experience adds a lot of value to your CV and our placement team will prepare your CV in such a way that you will have a good number of interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training and we will update your profile on Job Portals, this increases relevant interview call by 3x

    Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of

    • What type of technical questions are asked in interviews?
    • What are their expectations?
    • How should you prepare?


    Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.

    Yes, a one-to-one faculty discussion and demo session will be provided before admission. We understand the importance of trust between you and the trainer. We will be happy if you clear all your queries before you start classes with us.

    We understand the importance of every session. Sessions recording will be shared with you and in case of any query, faculty will give you extra time to answer your queries.

    Yes, we understand that self-learning is most crucial and for the same we provide students with PPTs, PDFs, class recordings, lab sessions, etc, so that a student can get a good handle of these topics.

    We provide an option to retake the course within 3 months from the completion of your course, so that you get more time to learn the concepts and do the best in your interviews.

    We believe in the concept that having less students is the best way to pay attention to each student individually and for the same our batch size varies between 5-10 people.

    Yes, we have batches available on weekends. We understand many students are in jobs and it's difficult to take time for training on weekdays. Batch timings need to be checked with our counsellors.

    Yes, we have batches available on weekdays but in limited time slots. Since most of our trainers are working, so either the batches are available in morning hours or in the evening hours. You need to contact our counsellors to know more on this.

    Total duration of the course is 200 hours (100 Hours of live-instructor-led training and 100 hours of self-paced learning).

    You don’t need to pay anyone for software installation, our faculties will provide you all the required softwares and will assist you in the complete installation process.

    Our faculties will help you in resolving your queries during and after the course.

Relevant interested Courses