Online course detail
Curriculum
Content designed by Microsoft Expert - Machine Learning, Big Data, Power BI
- Statistical learning vs. Machine learning
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modelling (Data Pre-processing, Sampling, Model Building, Validation)
- Concept of Overfitting and Under fitting (Bias-Variance Trade off) & Performance Metrics
- Types of Cross validation(Train & Test, Bootstrapping, K-Fold validation etc)
- Iteration and Model Evaluation
- Python Numpy (Data Manipulation)
- Python Pandas(Data Extraction & Cleansing)
- Python Matplot (Data Visualization)
- Python Scikit-Learn (Data Modelling)
- EDA – Quantitative Technique
- Data Exploration Techniques
- Sea-born | Matplotlib
- Correlation Analysis
- Data Wrangling
- Outliers Values in a DataSet
- Data Manipulation
- Missing & Categorical Data
- Splitting the Data into Training Set & Test Set
- Feature Scaling
- Concept of Over fitting and Under fitting (Bias-Variance Trade off) & Performance Metrics
- Types of Cross validation(Train & Test, Bootstrapping, K-Fold validation etc)
- Basic Data Structure & Data Types in Python language.
- Working with data frames and Data handling packages.
- Importing Data from various file sources like csv, txt, Excel, HDFS and other files types.
- Reading and Analysis of data and various operations for data analysis.
- Exporting files into different formats.
- Data Visualization and concept of tidy data.
- Handling Missing Information.
- Calls Data Capstone Project
- Finance Project : Perform EDA of stock prices. We will focus on BANK Stocks(JPMorgan, Bank Of America, Goldman Sachs, Morgan Stanley, Wells Fargo) & see how they progressed throughout the financial crisis all the way to early 2016.
- Fundamental of descriptive Statistics and Hypothesis testing (t-test, z-test).
- Probability Distribution and analysis of Variance.
- Correlation and Regression.
- Linear Modeling.
- Advance Analytics.
- Poisson and logistic Regression
- Feature Selection
- Principal Component Analysis(PCA)
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- Feature Reduction
- Simple Linear Regression
- Multiple Linear Regression
- Perceptron Algorithm
- Regularization
- Recursive Partitioning (Decision Trees)
- Ensemble Models (Random Forest , Bagging & Boosting (ADA, GBM)
- Ensemble Learning Methods
- Working of Ada Boost
- AdaBoost Algorithm & Flowchart
- Gradient Boosting
- XGBoost
- Polynomial Regression
- Support Vector Regression (SVR)
- Decision Tree Regression
- Evaluating Regression Models Performance
- Logistic Regression
- K-Nearest Neighbours(K-NN)
- Support Vector Machine(SVM)
- Kernel SVM
- Naive Bayes
- Decision Tree Classification
- Random Forest Classification
- Evaluating Classification Models Performance
- K-Means Clustering
- Challenges of Unsupervised Learning and beyond K-Means
- Hierarchical Clustering
- Purpose of Recommender Systems
- Collaborative Filtering
- Market Basket Analysis
- Collaborative Filtering
- Content-Based Recommendation Engine
- Popularity Based Recommendation Engine
- Anomaly Detection and Time Series Analysis
- Upper Confidence Bound (UCB)
- Thompson Sampling
- Spacy Basics
- Tokenization
- Stemming
- Lemmatization
- Stop-Words
- Vocabulary-and-Matching
- NLP-Basics Assessment
- TF-IDF
- Understanding Word Vectors
- Training the Word2Vec Model
- Exploring Pre-trained Models
- POS-Basics
- Visualizing POS
- NER-Named-Entity-Recognition
- Visualizing NER
- Sentence Segmentation
- Power BI Installation
- Power BI Desktop
- Data Imports into Power BI
- Views in Power BI
- Building a Dashboard
- Publishing Reports
- Creating Dashboard in Power BI
- Power Query
- Power Pivot
- Power View
- Power Map
- Power BI Service
- Power BI & QA
- Data Management Gateway
- Data Catalog
- Connect to DataSources
- Clean and Transform using Query Editor
- Advanced Datasources and transformations
- Cleaning irregularly formatted data.
- Bins
- Change Datatype of column
- Combine Multiple tables
- Clusters
- Format dates
- Groups
- Hierarchies
- Joins
- Pivot table
- Query Groups
- Split Columns
- Unpivot table
- How Data Model Looks Like
- Database Normalization
- Data Tables vs Lookup Tables
- Primary Key vs Foreign Key
- Relationships vs Merged Table
- Creating Table Relationships
- Creating Snowflake Schemas
- Managing & Editing Relationships
- Active vs Inactive Relationships
- Relationship Cardinality
- Many to Many
- One to One
- Connecting Multiple Data Tables
- Filter Flow
- Two Way Filter
- Two Way Filters Conditions
- Hiding Fields from Report View
- Area Chart
- Bar chart
- Card
- Clustered Bar chart
- Clustered Column chart
- Donut chart
- Funnel chart
- Heat Map
- Line Chart
- Clustered Column and Line chart
- Line and Stacked Column chart
- Matrix
- Multi-Row Card
- Pie chart
- Ribbon chart
- Stacked Area char
- Scatter Chart
- Stacked Bar chart
- Waterfall chart
- Map
- Filled Map
- Slicer Basic
- Filters
- Advanced Filters
- Top N Filters
- Filters on Measures
- Page-Level Filters
- Report Level Filters
- Drill through Filters
- Visuals in PowerBI
- Create and Customize simple visualizations
- Combination charts
- Slicers
- Map visualizations
- Matrixes and tables
- Scatter charts
- Waterfall and funnel charts
- Gauges and single-number cards
- Modify colors in charts and visuals
- Z-Order
- Duplicate a report page
- PowerBI Service
- QuickSights in PowerBI
- Create and configure a dashboard
- Share dashboards with organizations
- Install and configure the personal gateway
- Using Excel Data in PowerBI
- Upload Excel Data to PowerBI
- Import Power View and Power Pivot to PowerBI
- Connect OneDrive for Business to PowerBI
- Introduction to content packs, security, and groups
- Publish PowerBI Desktop reports
- Print and export dashboards and reports
- Manually republish and refresh your data
- Introduction PowerBI Mobile
- Creating groups in PowerBI
- Build content packs
- Use content packs
- Update content packs
- Integrate OneDrive for Business with PowerBI
- Publish to web
- Introduction to DAX
- DAX calculation types
- DAX functions:
- Using variables in DAX expressions
- Table relationships and DAX
- DAX tables and filtering
a) Aggregate Functions
b) Date Functions
c) Logical Functions
d) String Functions
e) Trigonometric Functions
- PowerBI SQL Server Integration
- PowerBI Mysql Integration
- PowerBI Excel Integration
- R Integration with PowerBI Desktop
- Objects & Charts
- Formatting Charts
- Report Interactions
- Bookmarks
- Managing Roles
- Custom Visuals
- Desktop vs Phone Layout
- Introduction to Amazon Elastic MapReduce
- AWS EMR Cluster
- AWS EC2 Instance: Multi-Node Cluster Configuration
- AWS EMR Architecture
- Web Interfaces on Amazon EMR
- Amazon S3
- Executing MapReduce Job on EC2 & EMR
- Apache Spark on AWS, EC2 & EMR
- Submitting Spark Job on AWS
- Hive on EMR
- Available Storage types: S3, RDS & DynamoDB
- Apache Pig on AWS EMR
- Processing NY Taxi Data using SPARK on Amazon EMR
This module will help you understand how to configure Hadoop Cluster on AWS Cloud:
- Common Hadoop ecosystem components
- Hadoop Architecture
- HDFS Architecture
- Anatomy of File Write and Read
- How MapReduce Framework works
- Hadoop high-level Architecture
- MR2 Architecture
- Hadoop YARN
- Hadoop 2.x core components
- Hadoop Distributions
- Hadoop Cluster Formation
This module will help you understand Big Data:
- Configuration files in Hadoop Cluster (FSimage & edit log file)
- Setting up of Single & Multi-node Hadoop Cluster
- HDFS File permissions
- HDFS Installation & Shell Commands
- Daemons of HDFS
- Node Manager
- Resource Manager
- NameNode
- DataNode
- Secondary NameNode
- YARN Daemons
- HDFS Read & Write Commands
- NameNode & DataNode Architecture
- HDFS Operations
- Hadoop MapReduce Job
- Executing MapReduce Job
This module will help you to understand Hadoop & HDFS Cluster Architecture:
- How MapReduce works on HDFS data sets
- MapReduce Algorithm
- MapReduce Hadoop Implementation
- Hadoop 2.x MapReduce Architecture
- MapReduce Components
- YARN Workflow
- MapReduce Combiners
- MapReduce Partitioners
- MapReduce Hadoop Administration
- MapReduce APIs
- Input Split & String Tokenizer in MapReduce
- MapReduce Use Cases on Data sets
This module will help you to understand Hadoop MapReduce framework:
- Hive Installation
- Hive Data types
- Hive Architecture & Components
- Hive Meta Store
- Hive Tables(Managed Tables and External Tables)
- Hive Partitioning & Bucketing
- Hive Joins & Sub Query
- Running Hive Scripts
- Hive Indexing & View
- Hive Queries (HQL); Order By, Group By, Distribute By, Cluster By, Examples
- Hive Functions: Built-in & UDF (User Defined Functions)
- Hive ETL: Loading JSON, XML, Text Data Examples
- Hive Querying Data
- Hive Tables (Managed & External Tables)
- Hive Used Cases
- Hive Optimization Techniques
- Partioning(Static & Dynamic Partition) & Bucketing
- Hive Joins > Map + BucketMap + SMB (SortedBucketMap) + Skew
- Hive FileFormats ( ORC+SEQUENCE+TEXT+AVRO+PARQUET)
- CBO
- Vectorization
- Indexing (Compact + BitMap)
- Integration with TEZ & Spark
- Hive SerDer ( Custom + InBuilt)
- Hive integration NoSQL (HBase + MongoDB + Cassandra)
- Thrift API (Thrift Server)
- Hive LATERAL VIEW
- Incremental Updates & Import in Hive Hive Functions:
- LATERAL VIEW EXPLODE
- 2) LATERAL VIEW JSON_TUPLE ...........others...
- Hive SCD Strategies :1) Type - 1 2) Type – 2 3) TYpe - 3
- UDF, UDTF & UDAF
- Hive Multiple Delimiters
- XML & JSON Data Loading HIVE.
- Aggregation & Windowing Functions in Hive
- Hive integration NoSQL(HBase + MongoDB + Cassandra)
- Hive Connect with Tableau
- Sqoop Installation
- Loading Data form RDBMS using Sqoop
- Fundamentals & Architecture of Apache Sqoop
- Sqoop Tools
- Sqoop Import & Import-All-Table
- Sqoop Job
- Sqoop Codegen
- Sqoop Incremental Import & Incremental Export
- Sqoop Merge
- Sqoop : Hive Import
- Sqoop Metastore
- Sqoop Export
- Import Data from MySQL to Hive using Sqoop
- Sqoop: Hive Import
- Sqoop Metastore
- Sqoop Use Cases
- Sqoop- HCatalog Integration
- Sqoop Script
- Sqoop Connectors
- Batch Processing in Sqoop
- SQOOP Incremental Import
- Boundary Queries in Sqoop
- Controlling Parallelism in Sqoop
- Import Join Tables from SQL databases to Warehouse using Sqoop
- Sqoop Hive/HBase/HDFS integration
- Pig Architecture
- Pig Installation
- Pig Grunt shell
- Pig Running Modes
- Pig Latin Basics
- Pig LOAD & STORE Operators
- Diagnostic Operators
- DESCRIBE Operator
- EXPLAIN Operator
- ILLUSTRATE Operator
- DUMP Operator
- Grouping & Joining
- GROUP Operator
- COGROUP Operator
- JOIN Operator
- CROSS Operator
- Combining & Splitting
- UNION Operator
- SPLIT Operator
- Filtering
- FILTER Operator
- DISTINCT Operator
- FOREACH Operator
- Sorting
- ORDERBYFIRST
- LIMIT Operator
- Built in Fuctions
- EVAL Functions
- LOAD & STORE Functions
- Bag & Tuple Functions
- String Functions
- Date-Time Functions
- MATH Functions
- Pig UDFs (User Defined Functions)
- Pig Scripts in Local Mode
- Pig Scripts in MapReduce Mode
- Analysing XML Data using Pig
- Pig Use Cases (Data Analysis on Social Media sites, Banking, Stock Market & Others)
- Analysing JSON data using Pig
- Testing Pig Sctipts
- KAFKA Confluent HUB
- KAFKA Confluent Cloud
- KStream APIs
- Difference between Apache KAFKA / Confluence KAFKA
- KSQL (SQL Engine for Kafka)
- Developing Real-time application using KStream APIs
- KSQL (SQL Engine for Kafka)
- Kafka Connectors
- Kafka REST Proxy
- Kafka Offsets
- Oozie Introduction
- Oozie Workflow Specification
- Oozie Coordinator Functional Specification
- Oozie H-catalog Integration
- Oozie Bundle Jobs
- Oozie CLI Extensions
- Automate MapReduce, Pig, Hive, Sqoop Jobs using Oozie
- Packaging & Deploying an Oozie Workflow Application
- Apache Airflow Installation
- Work Flow Design using Airflow
- Airflow DAG
- Module Import in Airflow
- Airflow Applications
- Docker Airflow
- Airflow Pipelines
- Airflow KUBERNETES Integration
- Automating Batch & Real Time Jobs using Airflow
- Data Profiling using Airflow
- Airflow Integration:
- AWS EMR
- AWS S3
- AWS Redshift
- AWS DynamoDB
- AWS Lambda
- AWS Kines
- Scheduling of PySpark Jobs using Airflow
- Airflow Orchestration
- Airflow Schedulers & Triggers
- Gantt Chart in Apache Airflow
- Executors in Apache Airflow
- Airflow Metrices
- Spark RDDs Actions & Transformations.
- Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.
- Spark Streaming
- Understanding role of RDD
- Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
- Spark Architecture & Components.
-
- AWS Lambda:
- AWS Lambda Introduction
- Creating Data Pipelines using AWS Lambda & Kinesis
- AWS Lambda Functions
- AWS Lambda Deployment
- AWS Lambda:
-
- AWS GLUE :
- GLUE Context
- AWS Data Catalog
- AWS Athena
- AWS Quiksight
- AWS GLUE :
-
- AWS Kinesis
- AWS S3
- AWS Redshift
- AWS EMR & EC2
- AWS ECR & AWS Kubernetes
- How to manage & Monitor Apache Spark on Kubernetes
- Spark Submit Vs Kubernetes Operator
- How Spark Submit works with Kubernetes
- How Kubernetes Operator for Spark Works.
- Setting up of Hadoop Cluster on Docker
- Deploying IMR , Sqoop & Hive Jobs inside Hadoop Dockerized environment.
1) Docker Installation
2) Docker Hub
3) Docker Images
4) Docker Containers & Shells
5) Working with Docker Containers
6) Docker Architecture
7) Docker Push & Pull containers
8) Docker Container & Hosts
9) Docker Configuration
10) Docker Files (DockerFile)
11) Docker Building Files
12) Docker Public Repositories
13) Docker Private Registeries
14) Building WebServer using DockerFile
15) Docker Commands
16) Docker Container Linking?
17) Docker Storage
18) Docker Networking
19) Docker Cloud
20) Docker Logging
21) Docker Compose
22) Docker Continuous Integration
23) Docker Kubernetes Integration
24) Docker Working of Kubernetes
25) Docker on AWS
1) Overview
2) Learn Kubernetes Basics
3) Kubernetes Installation
4) Kubernetes Architecture
5) Kubernetes Master Server Components
a) etcd
b) kube-apiserver
c) kube-controller-manager
d) kube-scheduler
e) cloud-controller-manager
6) Kubernetes Node Server Components
a) A container runtime
b) kubelet
c) kube-proxy
d) kube-scheduler
e) cloud-controller-manager
7) Kubernetes Objects & Workloads
a) Kubernetes Pods
b) Kubernetes Replication Controller & Replica Sets
8) Kubernetes Images
9) Kubernetes Labels & Selectors
10) Kubernetes Namespace
11) Kubernetes Service
12) Kubernetes Deployments
a) Stateful Sets
b) Daemon Sets
c) Jobs & Cron Jobs
13) Other Kubernetes Components:
a) Services
b) Volume & Persistent Volumes
c) Lables, Selectors & Annotations
d) Kubernetes Secrets
e) Kubernetes Network Policy
14) Kubernetes API
15) Kubernetes Kubectl
16) Kubernetes Kubectl Commands
17) Kubernetes Creating an App
18) Kubernetes App Deployment
19) Kubernetes Autoscaling
20) Kubernetes Dashboard Setup
21) Kubernetes Monitoring
22) Federation using kubefed
- CREATING DATA PIPELINES USING CLOUD & ON-PREMISE INFRASTRUCTURES
- HYBRID INFRASTRUCTURES
- Automation Test Cases(TDD & BDD Test Cases for Spark Applications)
- Data Security
- Data Governance
- Deployment Automation using CI/CD Pipelines with Docker & KUBERNETES.
- DESIGNING BATCH & REAL-TIME APPLICATION
- LATENCY & OPTIMIZATION in REAL-TIME APPLICATIONS
- Resolving Latency & Optimization issues using other Streaming Platforms like :
- Learn about big data and see examples of how data science can leverage big data
- Performing Data Science and Preparing Data – explore data science definitions and topics, and the process of preparing data
Overview of Hadoop & Apache PySpark
This will cover the introduction to the Hadoop ecosystem with a little insight into Apache Hive.
Introduction to Big Data and Data Science–
1) Introduction
2) PySpark Installation
3) PySpark & its Features
> Speed
> Reusability
> Advance Analytics
> In Memory Computation
> RealTime Stream processing
> Lazy Evaluation
> Dynamic in Nature
> Immutability
> Partitioning
4) PySpark with Hadoop
5) PySpark Components
6) PySpark Architecture
- Parallelized RDDs(PairRDDs) > RDD from Collection Objects : Example: map() & flatMap() function so that we able to perform groupByKey() & others....
- RDD from External Datasets (CSV,JSON,XML.....)
- rowRDD
- schemaRDD (Adding schema in RDD particularly Dataframes)
- Basic Operations like map() & flatMap() to convert into pairRDD.
- Actions
- Transformations :
1) RDD Overview
2) RDD Types (Ways to Create RDD):
3) RDD Operations:
Narrow Transformation |
Wide Transformation |
map() |
intersect() |
flatMap() |
distinct() |
mapPartition() |
reducebyKey() |
filter() |
groupbyKey() |
sample() |
join() |
union() |
cartesian() |
|
repartition() |
|
coalesc() |
Passing Functions to PySpark
Working with Key-Value Pairs (pairRDDs)
SuffleRDD : ShuffleOperations Background & Performance Impact
RDD Persistence & Unpersist
PySpark Deployment :
1) Local
2) Cluster Modes:
a) Client mode
b) Cluster mode
Shared Variables:
1) Broadcast Variables
2) Accumulators
Launching PySpark Jobs from Java/Scala
Unit Test Cases
PySpark API :
1) PySpark Core Module
2) PySpark SQL Module
3) PySpark Streaming Module
4) PySpark ML Module
Integrating with various Datasources (SQL+ NoSQL + Cloud)
- An introduction to the PySpark framework in cluster computing space.
- Reading data from different PySpark sources.
- PySpark Dataframes Actions & Transformations.
- PySpark SQL & Dataframes
- Basic Hadoop functionality
- Windowing functions in Hive
- PySpark Architecture and Components
- PySpark SQL and Dataframes
- PySpark Data frames
- PySpark SQL
- Running PySpark on cluster
- PySpark Performance Tips
- Inferring Schema using Reflection
- Programatically specifying the Schema
- Untyped User-Defined Aggregate Functions
- Type-Safe User-Defined Aggregare Functions
Dataframes:
1) Overview
2) Creating Dataframes using PySparkSession
3) Untyped Transformations
4) Running SQL Queries Programmatically
5) Global Temporary View
6) Interoperating with RDDs:
7) Aggregations:
- Manually Specifying Options
- Run SQL on files directly (avro/parquet files)
- Saving to Persistent Tables ( persistent tables into HiveMetastore)
- Bucketing, Sorting & Partitioning ( repartition() & coalesc() )
- Loading Data Programatically
- Partition Discovery
- Schema Merging
- Hive Metastore Parquet Table Conversion
- Hive Parquet/Schema Reconcialation
- Metadata Refreshing
- Specifying storage format for Hive Tables
- Interacting with various versions of Hive Metastore.
- Only for DataBricks Cloud (Using WatchDog)
- Caching Data in Memory
- Other Configurations like:
- BHJ (BroadcastHashJoin)
- Serializers
- Running the Thrift JDBC/ODBC server
- Running the PySpark SQL CLI (Accessing PySpark SQL using SQL shell)
- Apache Arrow in PySpark (To convert PySpark data frames into Python data frames)
- Conversion to/from Pandas
- Pandas UDFs
1) Generic LOAD/Save Functions
2) Parquet Files
5. Configuration
3) ORC Files
4) JSON Datasets
5) Hive Tables:
a) JDBC to other databases (SQL+ NoSQL databases)
b) Troubleshooting
c) PySpark SQL Optimizer
d) Transforming Complex Datatypes
e) Handling BadRecords & Files
f) Task Preemptionh Concurrency for Hig
Optional>>Only for DataBricks Cloud
g) Handling Large Queries in Interactive Flows
h) Skew Join Optimization
i) PySpark UDF Scala
j) PySpark UDF Python
k) PySpark UDAF Scala
l) Peformance Tuning/Optimization PySpark Jobs:
>maxPartitionBytes
>openCostInBytes
>broadcastTimeout
>autoBroadcastJoinThreshold
>partitions
m) Distributed SQL Engine: (Accessing PySpark SQL using JDBC/ODBC using Thrift API)
n) PySpark for Pandas using Apache Arrow:
o) PySpark SQL compatibility with Apache Hive
p) Working with DataFrames Python & Scala
q) Connectivity with various DataSources
a) Introduction to DataSets
b) DataSet API - DataSets Operators
c) DataSet API - Typed Tranformations
d) Datasets API - Basic Actions
e) DataSet API - Structured Query with Encoder
f) Windowing & Aggregation DataSet API
g) DataSet Checking & Persistence
h) DataSet Checkpointing
i) Datsets Vs Dataframes Vs RDDs
- Linking
- Initialize Streaming Context
- Discretized Streams (DStreams)
- Input DStreams & Receivers
- Transformations on DStreams
- Output Operations on DStreams : print(), saveAsTextFile(), saveAsObjectFiles(), saveAsHadoopFiles(), foreachRDD()
- Dataframe & SQL Operations on DStreams (Converting DStreams into Dataframe)
- Streaming Machine Learning on streaming data
- Caching/Persistence
- Checkpointing
- Accumulators, Broadcast Variables, and Checkpoints
- Deploying PySpark Streaming Applications.
- Monitoring PySpark Streaming Applications
- Reducing the Batch Processing Times
- Setting the Right Batch Interval
- Memory Tuning
- Kafka Integration
- Kinesis Integration
- Flume Integration
a) Overview
b) Example to demonstrate DStreams
c) PySpark Streaming Basic Concepts:
d) Performance Tuning PySpark Streaming Applications:
e) Fault Tolerance Semantics
f) Integration:
g) Custom Receivers : creating client/server application PySpark Streaming.
- Summary List of Operators
- Property Operators
- Structural Operators
- Join Operators
- Neighborhood Aggregation:
- VertexRDDs
- EdgeRDDs
- PageRank
- Connected Components
- Triangle Counting
a) Overview
b) Example to demonstrate GraphX
c) PropertyGraph: Example Property Graph
d) Graph Operators:
>Aggregation Messages (aggregate messages)
>Map Reduce Triplets Transition Guide (Legacy)
>Computing Degree Information
>Collecting Neighbours
6. Caching & Uncaching
e) Pregel API
f) Graph Builders
g) Vertex & Edge RDDs:
h) Optimisation Representation
i) Graph Algorithms
j) Examples
k) GraphFrames & GraphX
- How to manage & Monitor Apache Spark on Kubernetes
- Spark Submit Vs Kubernetes Operator
- How Spark Submit works with Kubernetes
- How Kubernetes Operator for Spark Works.
- Setting up of Hadoop Cluster on Docker
- Deploying MR, Sqoop & Hive Jobs inside Hadoop Dockerized environment.
- Building & running applications using PySpark API.
- PySpark SQL with mySQL (JDBC) source :
- Now that we have PySpark SQL experience with CSV and JSON, connecting and using a MySQL database will be easy. So, let’s cover how to use PySpark SQL with Python and a MySQL database input data source.
- Overview
- We’re going to load some NYC Uber data into a database. Then, we’re going to fire up PySpark with a command-line argument to specify the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.
** will also be going through some of the concepts like caching and UDFs.
A real-world business case built on top of PySpark API.
1) Web Server Log Analysis with Apache PySpark – use PySpark to explore a NASA Apache web server log.
2) Introduction to Machine Learning with Apache PySpark – use PySpark’s MLB Machine Learning library to perform collaborative filtering on a movie dataset
3) PySpark SQL with New York City Uber Trips CSV Source: PySpark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied by a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.
Methodology
We’re going to use the Uber dataset and the PySpark-CSV package available from PySpark Packages to make our lives easier. The PySpark-CSV package is described as a “library for parsing and querying CSV data with Apache PySpark, for PySpark SQL and DataFrames” This library is compatible with PySpark 1.3 and above.
PySpark SQL with New York City Uber Trips CSV Source
Software Required :
· VMWare Workstation
· Ubuntu ISO Image setup on Virtual Environment(VMWare)
· Cloudera Quickstart VM (version: 5.3.0)
· Putty Client
· WinSCP
· Hadoop software version 2.6.6
· PySpark 2.x
Course Description
Data Analytics Certification is designed to build your expertise in Data Science concepts like Machine Learning, Python, Big Data Analytics using Spark, Reporting tools like Power BI, Tableau. The course is designed for professionals looking for a 360 degree change in Data Engineering, Data Analytics and Data Visualization.
- Understand Scala & Apache Spark implementation
- Spark operations on Spark Shell
- Spark Driver & its related Worker Nodes
- Spark + Flume Integration
- Setting up Data Pipeline using Apache Flume, Apache Kafka & Spark Streaming
- Spark RDDs and Spark Streaming
- Spark MLib : Creating Classifiers & Recommendations systems using MLib
- Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
- Spark Architecture & Components
- Spark SQL experience with CSV , XML & JSON
- Reading data from different Spark sources
- Spark SQL & Dataframes
- Develop and Implement various Machine Learning Algorithms in daily practices & Live Environment
- Building Recommendation systems and Classifiers
- Perform various type of Analysis (Prediction & Regression)
- Implement plotting & graphs using various Machine Learning Libraries
- Import data from HDFS & Implement various Machine Learning Models
- Building different Neural networks using NumPy and TensorFlow
- Power BI Visualization
- Power BI Components
- Power BI Transformations
- Dax functions
- Data Exploration and Mapping
- Designing Dashboards
- Time Series, Aggregation & Filters
After the completion of Data Analytics Course, you will be able to:
We at Gyansetu understand that teaching any course is not difficult but to make someone job-ready is the essential task. That's why we have prepared capstone projects which will drive your learning through real-time industry scenarios and help you clearing interviews.
All the advanced level topics will be covered at Gyansetu in a classroom/online Instructor-led mode with recordings.
No prerequisites. This course is for beginners.
- Our placement team will add Data Analytics skills & projects to your CV and update your profile on Job search engines like Naukri, Indeed, Monster, etc. This will increase your profile visibility in top recruiter search and ultimately increase interview calls by 5x.
- Our faculty offers extended support to students by clearing doubts faced during the interview and preparing them for the upcoming interviews.
- Gyansetu’s Students are currently working in Companies like Sapient, Capgemini, TCS, Sopra, HCL, Birlasoft, Wipro, Accenture, Zomato, Ola Cabs, Oyo Rooms, etc.
Gyansetu is providing complimentary placement service to all students. Gyansetu Placement Team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training.
- Gyansetu trainers are well known in Industry; who are highly qualified and currently working in top MNCs.
- We provide interaction with faculty before the course starts.
- Our experts help students in learning Technology from basics, even if you are not good at basic programming skills, don’t worry! We will help you.
- Faculties will help you in preparing project reports & presentations.
- Students will be provided Mentoring sessions by Experts.
Certification
Data Analytics CertificationReviews
Placement
Preeti
Placed In:
Accenture
Placed On – July 15 , 2019Review:
I loved the 1-1 doubt clearing session from Gyansetu and the best part is you can have it scheduled as many times as you like for every challenge you have.
Pratik
Placed In:
Aon
Placed On – June 19 , 2017Review:
Courses are Value for Money and 24 *7 support.The instructors were also helpful enough to solve all the queries during session.
Khalid
Placed In:
Airtel
Placed On – December 25 , 2018Review:
They have entire road map for a learning professional/individual that will help achieve their career goal.
Enroll Now
Structure your learning and get a certificate to prove it.
Projects
- Pre-processing includes standard scaling, means normalizing the data followed by cross-validation techniques to check the compatibility of the data.
- In data modeling, using Decision Tree with Random forest and Gradient Boost hyperparameter tuning techniques to tune our model.
- In the end, evaluating the mode, by measuring confusion matrix with the accuracy of 98% and a trained model, which will show all the fraud transaction on PLCC & DC cards on the tableau dashboard.
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Amazon AWS, Elastic Search, Zookeeper
Tools & Techniques used : PySpark MLIB,Spark Streaming, Python (Jupiter Notebook, Anaconda), Machine Learning packages: Numpy, Pandas, Matplot, Seaborn, Sklearn ,Random forest and Gradient Boost, Confusing matrix Tableau
Description: Build a predictive model which will predict fraud transaction on PLCC &DC cards on daily bases. This includes data extraction then data cleaning followed by data pre-processing.
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive
Tools & Techniques used : Hadoop+HBase+Spark+Flink+Beam+ML stack, Docker & KUBERNETES, Kafka, MongoDB, AVRO, Parquet
Description: The aim is to create a Batch/Streaming/ML/WebApp stack where you can test your jobs locally or submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight Web application.
Data Analytics Course in Gurgaon, Delhi - Job Oriented Features
FAQs
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a relevant interview call is not a big challenge in your case. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. We help you prepare your CV by adding relevant projects and skills once 80% of the course is completed. Our placement team will update your profile on Job Portals, this increases relevant interview calls by 5x.
Interview selection depends on your knowledge and learning. As per the past trend, initial 5 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a technical interview call is a challenge at times. Most of the time you receive sales job calls/ backend job calls/ BPO job calls. No Worries!! Our Placement team will prepare your CV in such a way that you will have a good number of technical interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training. Our placement team will update your profile on Job Portals, this increases relevant interview call by 3x
Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
- What type of technical questions are asked in interviews?
- What are their expectations?
- How should you prepare?
We have seen getting a technical interview call is hardly possible. Gyansetu provides internship opportunities to the non-working students so they have some industry exposure before they appear in interviews. Internship experience adds a lot of value to your CV and our placement team will prepare your CV in such a way that you will have a good number of interview calls. We will provide you interview preparation sessions and make you job ready. Our placement team consistently works on industry collaboration and associations which help our students to find their dream job right after the completion of training and we will update your profile on Job Portals, this increases relevant interview call by 3x
Interview selection depends on your knowledge and learning. As per the past trend, initial 8 interviews is a learning experience of
Our faculty team will constantly support you during interviews. Usually, students get job after appearing in 6-7 interviews.
Yes, a one-to-one faculty discussion and demo session will be provided before admission. We understand the importance of trust between you and the trainer. We will be happy if you clear all your queries before you start classes with us.
We understand the importance of every session. Sessions recording will be shared with you and in case of any query, faculty will give you extra time to answer your queries.
Yes, we understand that self-learning is most crucial and for the same we provide students with PPTs, PDFs, class recordings, lab sessions, etc, so that a student can get a good handle of these topics.
We provide an option to retake the course within 3 months from the completion of your course, so that you get more time to learn the concepts and do the best in your interviews.
We believe in the concept that having less students is the best way to pay attention to each student individually and for the same our batch size varies between 5-10 people.
Yes, we have batches available on weekends. We understand many students are in jobs and it's difficult to take time for training on weekdays. Batch timings need to be checked with our counsellors.
Yes, we have batches available on weekdays but in limited time slots. Since most of our trainers are working, so either the batches are available in morning hours or in the evening hours. You need to contact our counsellors to know more on this.
Total duration of the course is 200 hours (100 Hours of live-instructor-led training and 100 hours of self-paced learning).
You don’t need to pay anyone for software installation, our faculties will provide you all the required softwares and will assist you in the complete installation process.
Our faculties will help you in resolving your queries during and after the course.