Gyansetu certified course on Big Data Hadoop is intended to start from basics and move gradually towards advancement, to eventually gain working command on Big Data analytics. We understand Big Data can be a daunting course and hence we at Gyansetu have divided it into easily understandable format that covers all possible aspects of big data.
Why Gyansetu is Best ?
- Gyansetu trainers are well known in Industry, they are highly qualified working professionals in MNCs, having a wide experience in training industry.
- We provide interaction with faculty before the course starts.
- Our Train the Trainer approach ensures you learn proactively and come out as an expert.
- We are open seven days a week and provide 24×7 Lab Support Services.
Hadoop Training Course Details
Gyansetu Big Data Training in Gurgaon will help you to understand core principles in Big Data Analytics also help us to gain core expertise in analysis of large datasets from various sources:
1. Concepts of MapReduce framework & HDFS filesystem.
2. Setting up of Single & Multi-Node Hadoop cluster.
3. Understanding HDFS architecture.
4. Writing MapReduce programs & logic
5. Learn Data Loading using Sqoop from structured sources.
6. Understanding Flume Configuration used for data loading.
7. Data Analytics using Pig.
8. Understanding hive for data analytics.
9. Scheduling MapReduce,Pig,Hive,Sqoop Jobs using Oozie
10. Understanding Kafka messaging system.
11. MapReduce and HBase Integration.
12. Spark Introduction.
13. Understanding Spark Ecosystem.
14. RDD Actions & Transformations.
15. Understanding Spark Architecture.
16. Spark SQL & Streaming Modules in Spark Ecosystem.
17. Live Projects on Big Data Analytics.
18. AWS Cloud
19. Deploying BIG Data application in Production environment using Docker & Kubernetes
Who should go for Hadoop Course?
Big Data market growing rapidly & data size is increasing day by day and IT needs expert Big Data Professionals in coming years . It will be helpful for persons working in IT as:
1. Testing professionals
2. Senior IT Professionals
3. BI /ETL/DW professionals
4. Developers and Architects
5. Mainframe professionals
Pre-requisites for the Big Data Hadoop Training Course?
There will be no pre-requisites but Knowledge of Java,Python & SQL will be beneficial, but not mandatory. Gyansetu provides a crash course for pre-requisites required to initiate Big Data training.
Pre-requisites for the Big Data Hadoop Training Course?
There will be no pre-requisites but Knowledge of Java/ Python, SQL, Linux will be beneficial, but not mandatory. Gyansetu provides a crash course for pre-requisites required to initiate Big Data training.
Apache Hadoop on AWS Cloud
This module will help you understand how to configure Hadoop Cluster on AWS Cloud:
- Introduction to Amazon Elastic MapReduce
- AWS EMR Cluster
- AWS EC2 Instance: Multi Node Cluster Configuration
- AWS EMR Architecture
- Web Interfaces on Amazon EMR
- Amazon S3
- Executing MapReduce Job on EC2 & EMR
- Apache Spark on AWS, EC2 & EMR
- Submitting Spark Job on AWS
- Hive on EMR
- Available Storage types: S3, RDS & DynamoDB
- Apache Pig on AWS EMR
- Processing NY Taxi Data using SPARK on Amazon EMR
1. Learning Big Data and Hadoop
This module will help you understand Big Data:
- Common Hadoop ecosystem components
- Hadoop Architecture
- HDFS Architecture
- Anatomy of File Write and Read
- How MapReduce Framework works
- Hadoop high level Architecture
- MR2 Architecture
- Hadoop YARN
- Hadoop 2.x core components
- Hadoop Distributions
- Hadoop Cluster Formation
2. Hadoop Architecture and HDFS
This module will help you to understand Hadoop & HDFS Cluster Architecture:
- Configuration files in Hadoop Cluster (FSimage & editlog file)
- Setting up of Single & Multi node Hadoop Cluster
- HDFS File permissions
- HDFS Installation & Shell Commands
- Deamons of HDFS
- Node Manager
- Resource Manager
- Secondary NameNode
- YARN Deamons
- HDFS Read & Write Commands
- NameNode & DataNode Architecture
- HDFS Operations
- Hadoop MapReduce Job
- Executing MapReduce Job
3. Hadoop MapReduce Framework
This module will help you to understand Hadoop MapReduce framework:
- How MapReduce works on HDFS data sets
- MapReduce Algorithm
- MapReduce Hadoop Implementation
- Hadoop 2.x MapReduce Architecture
- MapReduce Components
- YARN Workflow
- MapReduce Combiners
- MapReduce Partitioners
- MapReduce Hadoop Administration
- MapReduce APIs
- Input Split & String Tokenizer in MapReduce
- MapReduce Use Cases on Data sets
4. Advanced MapReduce Concepts
This module will help you to learn:
- Job Submission & Monitoring
- Distributed Cache
- Map & Reduce Join
- Data Compressors
- Job Configuration
- Record Reader
Batch Processing Tools in Hadoop (Batch ETL)
- Sqoop (Data Ingestion tool)
- Map Reduce
This module will build your concepts in learning:
- Hive Installation
- Hive Data types
- Hive Architecture & Components
- Hive Meta Store
- Hive Tables(Managed Tables and External Tables)
- Hive Partitioning & Bucketing
- Hive Joins & Sub Query
- Running Hive Scripts
- Hive Indexing & View
- Hive Queries (HQL); Order By, Group By, Distribute By, Cluster By, Examples
- Hive Functions: Built-in & UDF (User Defined Functions)
- Hive ETL: Loading JSON, XML, Text Data Examples
- Hive Querying Data
- Hive Tables (Managed & External Tables)
- Hive Used Cases
- Hive Optimization Techniques
- Partioning(Static & Dynamic Partition) & Bucketing
- Hive Joins > Map + BucketMap + SMB (SortedBucketMap) + Skew
- Hive FileFormats ( ORC+SEQUENCE+TEXT+AVRO+PARQUET)
- Indexing (Compact + BitMap)
- Integration with TEZ & Spark
- Hive SerDer ( Custom + InBuilt)
- Hive integration NoSQL (HBase + MongoDB + Cassandra)
- Thrift API (Thrift Server)
- Hive LATERAL VIEW
- Incremental Updates & Import in Hive
- Hive Functions: 1) LATERAL VIEW EXPLODE 2) LATERAL VIEW JSON_TUPLE ………..others…
- Hive SCD Strategies :1) Type – 1 2) Type – 2 3) TYpe – 3
- UDF, UDTF & UDAF
- Hive Multiple Delimiters
- XML & JSON Data Loading HIVE.
- Aggregation & Windowing Functions in Hive
- Hive integration NoSQL(HBase + MongoDB + Cassandra)
- Hive Connect with Tableau
- Sqoop Installation
- Loading Data form RDBMS using Sqoop
- Fundamentals & Architecture of Apache Sqoop
- Sqoop Tools
- Sqoop Import & Import-All-Table
- Sqoop Job
- Sqoop Codegen
- Sqoop Incremental Import & Incremental Export
- Sqoop Merge
- Sqoop : Hive Import
- Sqoop Metastore
- Sqoop Export
- Import Data from MySQL to Hive using Sqoop
- Sqoop: Hive Import
- Sqoop Metastore
- Sqoop Use Cases
- Sqoop- HCatalog Integration
- Sqoop Script
- Sqoop Connectors
- Batch Processing in Sqoop
- SQOOP Incremental Import
- Boundary Queries in Sqoop
- Controlling Parallelism in Sqoop
- Import Join Tables from SQL databases to Warehouse using Sqoop
- Sqoop Hive/HBase/HDFS integration
This module will help you to understand Pig Concepts:
- Pig Architecture
- Pig Installation
- Pig Grunt shell
- Pig Running Modes
- Pig Latin Basics
- Pig LOAD & STORE Operators
- Diagnostic Operators
- DESCRIBE Operator
- EXPLAIN Operator
- ILLUSTRATE Operator
- DUMP Operator
- Grouping & Joining
- GROUP Operator
- COGROUP Operator
- JOIN Operator
- CROSS Operator
- Combining & Splitting
- UNION Operator
- SPLIT Operator
- FILTER Operator
- DISTINCT Operator
- FOREACH Operator
- LIMIT Operator
- Built in Fuctions
- EVAL Functions
- LOAD & STORE Functions
- Bag & Tuple Functions
- String Functions
- Date-Time Functions
- MATH Functions
- Pig UDFs (User Defined Functions)
- Pig Scripts in Local Mode
- Pig Scripts in MapReduce Mode
- Analysing XML Data using Pig
- Pig Use Cases (Data Analysis on Social Media sites, Banking, Stock Market & Others)
- Analysing JSON data using Pig
- Testing Pig Sctipts
Real Time Processing Tools & Real Time Data Storage inside Data Lake (Real Time ETL):
Apache Flume (Real Time ETL)
This module will help you to learn Flume Concepts:
- Flume Introduction
- Flume Architecture
- Flume Data Flow
- Flume Configuration
- Flume Agent Component Types
- Flume Setup
- Flume Interceptors
- Multiplexing (Fan-Out), Fan-In-Flow
- Flume Channel Selectors
- Flume Sync Processors
- Fetching of Streaming Data using Flume (Social Media Sites: YouTube, LinkedIn, Twitter)
- Flume + Kafka Integration
- Flume Use Cases
Apache KAFKA (Real Time Data Storage)
This module will help you to learn Kafka concepts:
- Kafka Fundamentals
- Kafka Cluster Architecture
- Kafka Workflow
- Kafka Producer, Consumer Architecture
- Kafka as PUB/SUB model
- KAFKA Terminologonliineies / Core APIs:
- Producer / Publishers
- Consumer / Subscribers
- Input Offsets
- Topic Log
- Consumer Groups
- Mirror Maker
- Topic Partition
- Kafka Retention Policy
- Integration with SPARK
- Kafka Topic Architecture
- Zookeeper & Kafka
- Kafka Partitions
- Kafka Consumer Groups
- Kafka External APIs / Confluence KAFKA
Confluent Kafka External APIs & Related Services
- KAFKA Connect
- KAFKA Rest Proxy
- KAFKA AVRO
- KAFKA Schema Registry
- KAFKA Confluent HUB
- KAFKA Confluent Cloud
- KStream APIs
- Difference between Apache KAFKA / Confluence KAFKA
- KSQL (SQL Engine for Kafka)
- Developing Real-time application using KStream APIs
- KSQL (SQL Engine for Kafka)
- Kafka Connectors
- Kafka REST Proxy
- Kafka Offsets
Orchestration Engine for Scheduling of ETL Jobs(Batch & Real-Time) in Big Data Applications:
This module will help you to understand Oozie concepts:
- Oozie Introduction
- Oozie Workflow Specification
- Oozie Coordinator Functional Specification
- Oozie H-catalog Integration
- Oozie Bundle Jobs
- Oozie CLI Extensions
- Automate MapReduce, Pig, Hive, Sqoop Jobs using Oozie
- Packaging & Deploying an Oozie Workflow Application
- Apache Airflow Installation
- Work Flow Design using Airflow
- Airflow DAG
- Module Import in Airflow
- Airflow Applications
- Docker Airflow
- Airflow Pipelines
- Airflow KUBERNETES Integration
- Automating Batch & Real Time Jobs using Airflow
- Data Profiling using Airflow
- Airflow Integration:
- AWS EMR
- AWS S3
- AWS Redshift
- AWS DynamoDB
- AWS Lambda
- AWS Kines
- Scheduling of PySpark Jobs using Airflow
- Airflow Orchestration
- Airflow Schedulers & Triggers
- Gantt Chart in Apache Airflow
- Executors in Apache Airflow
- Airflow Metrices
Commonly used NOSQL Databases in Big Data Pipelines:
This module will help you to learn HBase Architecture:
- HBase Architecture, Data Flow & Use Cases
- Apache HBase Configuration
- HBase Shell & general commands
- HBase Schema Design
- HBase Data Model
- HBase Region & Master Server
- HBase & MapReduce
- Bulk Loading in HBase
- Create, Insert, Read Tables in HBase
- HBase Admin APIs
- HBase Security
- HBase vs Hive
- Backup & Restore in HBase
- Apache HBase External APIs (REST, Thrift, Scala)
- HBase & SPARK
- Apache HBase Coprocessors
- HBase Case Studies
- HBase Trobleshooting
- Cassandra Installation
- CASSANDRA ARCHITECTURE LAYERS & ITS RELATED COMPONENTS
- Cassandra Configuration
- Operating Cassandra
- Cassandra Tools
- Cassandra Stress
- Partitioners in Cassandra BLOOM FILTERS
- Tunning Cassandra Performance
- Read/Write Cassandra
- Cassandra Queries (CQL)
- CASSANDRA COMPACTION STRATEGIES
Data Processing with Apache Spark
Spark executes in-memory data processing & how Spark Job runs faster then Hadoop MapReduce Job. Course will also help you understand the Spark Ecosystem & it related APIs like Spark SQL, Spark Streaming, Spark MLib, Spark GraphX & Spark Core concepts as well. This course will help you to understand Data Analytics & Machine Learning algorithms applying to various datasets to process & to analyze large amount of data.
- Spark RDDs.
- Spark RDDs Actions & Transformations.
- Spark SQL : Connectivity with various Relational sources & its convert it into Data Frame using Spark SQL.
- Spark Streaming
- Understanding role of RDD
- Spark Core concepts : Creating of RDDs: Parrallel RDDs, MappedRDD, HadoopRDD, JdbcRDD.
- Spark Architecture & Components.
Big Data Related Services on AWS Cloud:
Brief Introduction of some AWS services:
- AWS Lambda:
- AWS Lambda Introduction
- Creating Data Pipelines using AWS Lambda & Kinesis
- AWS Lambda Functions
- AWS Lambda Deployment
- AWS GLUE :
- GLUE Context
- AWS Data Catalog
- AWS Athena
- AWS Quiksight
- AWS Kinesis
- AWS S3
- AWS Redshift
- AWS EMR & EC2
- AWS ECR & AWS Kubernetes
Deploying BIG Data application in Production environment using Docker & Kubernetes:
- How to manage & Monitor Apache Spark on Kubernetes
- Spark Submit Vs Kubernetes Operator
- How Spark Submit works with Kubernetes
- How Kubernetes Operator for Spark Works.
- Setting up of Hadoop Cluster on Docker
- Deploying MR , Sqoop & Hive Jobs inside Hadoop Dockerized environment.
Docker Overview for Deploying Big Data Applications :
1) Docker Installation
2) Docker Hub
3) Docker Images
4) Docker Containers & Shells
5) Working with Docker Containers
6) Docker Architecture
7) Docker Push & Pull containers
8) Docker Container & Hosts
9) Docker Configuration
10) Docker Files (DockerFile)
11) Docker Building Files
12) Docker Public Repositories
13) Docker Private Registeries
14) Building WebServer using DockerFile
15) Docker Commands
16) Docker Container Linking
17) Docker Storage
18) Docker Networking
19) Docker Cloud
20) Docker Logging
21) Docker Compose
22) Docker Continuous Integration
23) Docker Kubernetes Integration
24) Docker Working of Kubernetes
25) Docker on AWS
Kubernetes Overview for Deploying BIG Data Applications :
2) Learn Kubernetes Basics
3) Kubernetes Installation
4) Kubernetes Architecture
5) Kubernetes Master Server Components
6) Kubernetes Node Server Components
a) A container runtime
7) Kubernetes Objects & Workloads
a) Kubernetes Pods
b) Kubernetes Replication Controller & Replica Sets
8) Kubernetes Images
9) Kubernetes Labels & Selectors
10) Kubernetes Namespace
11) Kubernetes Service
12) Kubernetes Deployments
a) Stateful Sets
b) Daemon Sets
c) Jobs & Cron Jobs
13) Other Kubernetes Components:
b) Volume & Persistent Volumes
c) Lables, Selectors & Annotations
d) Kubernetes Secrets
e) Kubernetes Network Policy
14) Kubernetes API
15) Kubernetes Kubectl
16) Kubernetes Kubectl Commands
17) Kubernetes Creating an App
18) Kubernetes App Deployment
19) Kubernetes Autoscaling
20) Kubernetes Dashboard Setup
21) Kubernetes Monitoring
22) Federation using kubefed
IMPORTANT POINTS THAT WILL BE DISCUSSED DURING LAST PHASE OF TRAINING:
- CREATING DATA PIPELINES USING CLOUD & ON-PREMISE INFRASTRUCTURES
- HYBRID INFRASTRUCTURES
- Automation Test Cases(TDD & BDD Test Cases for Spark Applications)
- Data Security
- Data Governance
- Deployment Automation using CI/CD Pipelines with Docker & KUBERNETES.
- DESGINING BATCH & REAL-TIME APPLICATION
- LATENCY & OPTIMIZATION in REAL-TIME APPLICATIONS
- Resolving Latency & Optimization issues using others Streaming Platforms like :
BIG DATA PROJECTS
- BIG DATA Playground (Big Data based on Docker & KUBERNETES)
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive
Tools & Techniques used : Hadoop+HBase+Spark+Flink+Beam+ML stack, Docker & KUBERNETES, Kafka, MongoDB, AVRO, Parquet
Description : The aim is to create a Batch/Streaming/ML/WebApp stack where you can test your jobs locally or to submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight Web application.
- Chain Based Credit Card Fraud Detection / Customer Insights 360
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Amazon AWS, Elastic Search, Zookeeper
Tools & Techniques used : PySpark MLIB,Spark Streaming, Python (Jupiter Notebook, Anaconda), Machine Learning packages: Numpy, Pandas, Matplot, Seaborn, Sklearn ,Random forest and Gradient Boost, Confusing matrix Tableau
Description : Build a predictive model which will predict fraud transaction on PLCC &DC cards on daily bases. This includes data extraction then data cleaning followed by data pre processing. • Pre processing includes standard scaling, means normalizing the data followed by cross validation techniques to check the compatibility of the data. • In data modeling, using Decision Tree with Random forest and Gradient Boost hyper parameter tuning techniques to tune our model. • In the end, evaluating the mode, by measuring confusion matrix with accuracy of 98% and a trained model, which will show all the fraud transaction on PLCC & DC cards on tableau dashboard.
After completion of course , you will be able to analyze Large Datasets & will work on a live project using PIG,HBase,HIVE & MapReduce to perform Analysis.
We will work on case studies related to domains like Finance, Media, Media, Stocks & more.
Case #1: Working with MapReduce, Pig, Hive & Flume
Problem Statement : Fetch structured & unstructured data sets from various sources like Social Media Sites, Web Server & structured source like MySQL, Oracle & others
and dump it into HDFS and then analyze the same datasets using PIG,HQL queries & MapReduce technologies to gain proficiency in Hadoop related stack & its ecosystem tools.
Data Analysis Steps in :
1. Dump XML & JSON datasets into HDFS.
2. Convert semi-structured data formats(JSON & XML) into structured format using Pig,Hive & MapReduce.
3. Push the data set into PIG & Hive environment for further analysis.
4. Writing Hive queries to push the output into relational database(RDBMS) using Sqoop.
5. Renders the result in Box Plot, Bar Graph & others using R & Python integration with Hadoop.
Case #2: Analyze Stock Market Data
Data : Data set contains stock information such as daily quotes ,Stock highest price, Stock opening price on New York Stock Exchange.
Problem Statement: Calculate Co-variance for stock data to solve storage & processing problems related to huge volume of data.
a)Positive Covariance, If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.
b)Negative Co-variance, If return move inversely,If investment tends to be up while other is down, this shows Negative Co-variance.
Case #3: Hive, Pig & MapReduce with New York City Uber TripsProblem Statement: What was the busiest dispatch base by trips for a particular day on entire month?What day had the most active vehicles.What day had the most trips sorted by most to fewest.Dispatching_Base_Number is the NYC taxi & Limousine company code of that base that dispatched the UBER.active_vehicles shows the number of active UBER vehicles for a particular date & company(base). Trips is the number of trips for a particular base & date.
Case #4: Analyze Tourism Data
1. Top 20 destinations tourist frequently travel to: Based on given data we can find the most popular destinations where people travel frequently, based on the specific initial number of trips booked for a particular destination
2. Top 20 high air-revenue destinations, i.e the 20 cities that generate high airline revenues for travel, so that the discount offers can be given to attract more bookings for these destinations.
3. Top 20 locations from where most of the trips start based on booked trip count.
Case #5: Airport Flight Data Analysis
1. List of Delayed flights.
2. Find flights with zero stop.
3. List of Active Airlines all countries.
4. Source & Destination details of flights.
5. Reason why flight get delayed.
6. Time in different formats.
Case #6: Analyze Movie Ratings
Problem Statement: Analyze the movie ratings by different users to:
1. Get the user who has rated the most number of movies
2. Get the user who has rated the least number of movies
3. Get the count of total number of movies rated by user belonging to a specific occupation
4. Get the number of underage users
Case #7: Analyze Social Media Channels :
Data: DataSet Columns : VideoId, Uploader, Internal Day of establishment of You tube & the date of uploading of the video,Category,Length,Rating, Number of comments.
Problem Statement: Top 5 categories with maximum number of videos uploaded.
Problem Statement: Identify the top 5 categories in which the most number of videos are uploaded, the top 10 rated videos, and the top 10 most viewed videos.