Leading Premier PMI Partner Globally | GenAI in Project Management
Transform
into a Big Data expert with eVani's comprehensive program. Master core
concepts, tools, and technologies like Hadoop, Spark, Scala, and PySpark. Gain hands-on
experience through real-world projects. Build scalable data pipelines and
perform complex data analysis. Become proficient in data warehousing, mining,
and machine learning. Acquire in-demand skills for high-paying Big Data roles.
Join us to unlock your potential in the data-driven world.
eVani's
Big Data Champion program is designed to equip professionals with the skills
and knowledge required to excel in the field of Big Data engineering. This
comprehensive program covers the core competencies of a Big Data engineer,
delving deep into Apache Spark, Scala, and PySpark. Through a blend of
theoretical understanding and hands-on practical experience, participants will
gain expertise in handling massive datasets, performing complex data
processing, and building scalable data pipelines.
Learning Objective
By the
end of this course, participants will be able to:-
·
Understand the fundamentals of Big Data and
its applications.
·
Learn about various Big Data technologies
and architectures.
·
Develop skills in data warehousing and data
mining.
·
Gain proficiency in big data analytics tools
and techniques
Pre-requisite
·
Basic knowledge of programming (Python or
Java) and SQL.
·
Familiarity with general IT concepts and
practices.
Target Audience
·
IT professionals seeking to expand their Big
Data cloud skills.
·
Anyone looking to enhance their career
prospects with Big Data expertise.
·
Aspiring Data analysts, data engineers,
software developers, and professionals looking to enter the Big Data domain.
• What is Big Data? . • Characteristics of Big Data (Volume, Velocity, Variety, Veracity). • Categories of Big Data. • Technologies of Big Data. • Challenges of traditional data processing systems. • Big Data applications in various industries.
• What is Hadoop? • Why Hadoop? • Hadoop Ecosystem. • Hadoop distributions (Cloudera, Hortonworks, MapR). • Hadoop architecture: NameNode, DataNode, Secondary NameNode and components. • Data Replication and Fault Tolerance. • RDBMS Vs Hadoop. • Introduction to Hadoop Distributed File System (HDFS). • HDFS Commands.
• What is MapReduce? • Traditional & MapReduce way of data processing. • MapReduce Components. • How MapReduce works?
• Pig Latin scripting language. • Load, transform, and store data. • Using UDFs and custom functions.
• Introduction to HiveQL. • Creating and managing tables. • Data definition language (DDL) and data manipulation language (DML). • Hive optimization techniques.
Introduction to NoSQL and HBase • NoSQL databases vs. relational databases. • HBase data model (row, column family, column, qualifier, value). • HBase architecture. HBase Operations • Creating tables and regions. • Reading and writing data. • HBase shell commands. • HBase integration with Hive.
Sqoop: • Importing data from relational databases to HDFS. • Exporting data from HDFS to relational databases. • Incremental loads and full loads. Flume: • Designing data collection pipelines. • Handling various data sources (logs, files, etc.). • Flume agents and channels.
• Oozie architecture and components. • Creating and executing workflows. • Coordinating MapReduce, Hive, Pig, and Sqoop jobs. • Error handling and retries. • Apache Airflow Architecture • Execution of Apache Airflow
• Introduction to Scala. • Core language features: variables, data types, operators. • Control flow statements (if-else, loops). • Functions and higher-order functions. • Object-oriented programming concepts (classes, objects, inheritance, polymorphism). • Functional programming concepts (immutability, pattern matching, closures) . • Collections (Lists, Maps, Sets, Tuples).
• Introduction to Apache Spark. • Spark architecture and components. • Resilient Distributed Datasets (RDDs). • Transformations and actions. • SparkContext and SparkSession. • Data loading and saving. • Caching and persistence. • Shared variables (broadcast variables, accumulators).
• Introduction to Spark SQL. • DataFrames and Datasets. • SQL-like operations on DataFrames. • Creating DataFrames from various sources. • Schema manipulation. • Advanced SQL queries and optimizations. • Integration with Hive.
• Introduction to Spark Streaming. • Discretized Streams (DStreams). • Input and output sources. • Transformations and output operations. • State management. • Checkpoint and recovery. • Integration with Kafka.
• Introduction to Spark MLlib. • MLlib pipeline. • Classification and regression algorithms. • Clustering algorithms. • Collaborative filtering. • Feature extraction and transformation. • Model evaluation and tuning.
• Introduction to GraphX. • Graph representation and operations. • Graph algorithms. • PageRank. • Connected components. • Triangle count.
• Understanding Spark performance metrics. • Identifying performance bottlenecks. • Data partitioning and shuffling. • Caching and persistence optimization. • Resource allocation and configuration.
• Python programming basics: data types, control flow, functions, object-oriented programming. • Introduction to Apache Spark and its architecture. • PySpark environment setup and configuration. • Understanding RDDs (Resilient Distributed Datasets). • Core PySpark operations: transformations and actions.
• Data Ingestion and Manipulation. • Reading data from various sources: CSV, JSON, Parquet, text files, databases. • Writing data to different formats. • Data cleaning and preprocessing: handling missing values, outliers, and inconsistencies. • Data exploration and analysis using Pandas on PySpark DataFrames. • Creating custom data types and user-defined functions (UDFs).
• Introduction to Spark SQL. • Creating DataFrames and Datasets. • SQL-like operations on DataFrames. • Advanced SQL queries and optimizations. • Working with complex data structures. • Integrating PySpark SQL with Hive.
• Introduction to machine learning with PySpark. • Data preparation for machine learning. • Feature engineering and selection. • Classification algorithms (Logistic Regression, Decision Trees, Random Forest). • Regression algorithms (Linear Regression, Decision Trees, Random Forest). • Clustering algorithms (K-Means, Gaussian Mixture Models). • Model evaluation and tuning. • Pipeline creation and deployment.
• Introduction to Spark Streaming. • Creating DStreams (Discretized Streams). • Input and output operations. • State management and updates. • Windowing and aggregation. • Real-time data processing pipelines.
• Building end-to-end data pipelines. • Orchestration tools (Airflow, Luigi). • Data quality and validation. • Performance optimization techniques. • Debugging and troubleshooting.
• PySpark GraphX for graph processing. • Spark Structured Streaming. • Distributed deep learning with PySpark. • Cloud integration (AWS, Azure, GCP). • Big data project case studies.
• Project 1 • Project 2
Big Data Champion
No Review found