Training Course

Designing Big Data solutions using Apache Hadoop

Course intended for:

The training is aimed at developers, architects and application administrators who want to create or maintain systems based on scalable architectures such as Big Data, in particular specialists for whom performance and volume of processed data is the highest priority. The course is aimed at people currently involved in creation of relational databases who want to gain knowledge of alternative technologies, which gradually displace relational databases from different application areas. The training is also dedicated for people who want to supplement their knowledge of concepts of Big Data, MapReduce, NoSQL and their implementation using Apache Hadoop & Family software.

Course objective:

Course participants will gain knowledge of cross-cutting concepts such as MapReduce algorithm, Big Data, BigTable, DFS distributed file systems, NoSQL databases. After the training, course participants will be able to choose the right techniques for their projects. In addition to general introduction to Big Data, this training is focusing on the entire Apache Hadoop stack.

Course strengths:

Training program includes a general introduction to Big Data as well as the overall presentation of the Apache Hadoop. The training is unique because its subject is not fully covered in the literature and knowledge of Big Data and NoSQL is highly fragmented. Training program is constantly updated due to a rapid development of Big Data solutions.


Participants are required to have basic knowledge of databases and basic programming skills in Java.

Course parameters:

5 * 7 hours of lectures and workshops at a ratio of 1:3. During the workshops, in addition to simple exercises, participants will solve problems by implementing their own data processing algorithms using the MapReduce paradigm, model NoSQL data structures and perform basic administrative tasks in Apache Hadoop. Group size: max. 8-10 people.

Course Agenda
  1. Introduction

    1. Big Data, BigTable, BigQuery, MapReduce

    2. MapReduce in details

    3. MapReduce compared to other distributed processing techniques such as MPI, PVM etc.

    4. Apache Hadoop & Family

  2. Apache Hadoop

    1. Architecture

    2. Hadoop 1.0 vs 2.0

    3. Hadoop Shell Commands

    4. Apache Hadoop Distributed File System (HDFS)

      1. Architecture, NameNodes, DataNodes

      2. Federation and clustering

      3. File attributes

      4. Snapshots

      5. WebHDFS, HttpFS, FUSE

      6. Comparison to other distributed file systems

    5. Apache Hadoop NextGen MapReduce (YARN, MRv2)

      1. Architecture

        1. ResourceManager

        2. Scheduler

        3. ApplicationsManager

        4. JobTracker i TaskTracker

      2. YARN shell

      3. Hadoop/YARN API

      4. YARN REST API

      5. MapReduce 1.0 vs MapReduce 2.0, API compatibility

      6. Examples

    6. Apache Hadoop administration

      1. Installation and configuration

      2. Demons, configuration files, log files

      3. Hadoop On Demand, Hadoop Cluster Setup

      4. HDFS administration

        1. Files attributed

        2. Quota

      5. MaReduce administration

        1. Jobs management

        2. Scheduling

      6. Cluster rebalancing

      7. Monitoring

      8. Administration tools

  3. Apache PIG

    1. Introduction

      1. Architecture

      2. Work modes

      3. PigLatin

      4. Hadoop/YARN API and PigLatin

    2. PigLatin in details

      1. Syntax

      2. Datatypes

      3. Operators

      4. Built-in and user defined functions

    3. Built-in functions

      1. Simple (eval functions)

      2. Data management

      3. Mathematical

      4. Strings

      5. Date time

      6. Other

    4. User defined functions (UDF)

      1. UDF in Java

      2. UDF in JavaScript

      3. UDF in Python/Jython/Groovy

      4. Piggybank

    5. Efficiency

      1. Combiner

      2. Multi-Query Execution

      3. Optimization rules

      4. Good practices

    6. Testing and troubleshooting

      1. Diagnostics

      2. Statistics

      3. PigUnit

  4. Apache HBase

    1. Introduction

      1. Introduction to NoSQL databases

      2. The cause of the cloud databases increasing popularity

      3. Consistency, availability, resistance to partitioning

      4. CAP theorem

      5. What distinguishes NoSQL database from relational databases

      6. Basic parameters of NoSQL databases

      7. Classification and overview of NoSQL databases (Cassandra, HBase, Mongo, Riake, CouchDB, Tokyo Cabinet, Voldemort, etc.)

      8. The problem of transactions and replication of NoSQL databases, including MongoDB

      9. Unique features of HBase

    2. HBase architecture

      1. Catalogs

      2. Master Servers

      3. Regions and Region Servers

    3. Data model

      1. Conceptual and physical

      2. Namespaces

      3. Table

      4. Row

      5. Column

      6. Version

      7. Cell

    4. HBase

      1. HBase API

      2. HBase in Apache Hadoop and MapReduce jobs

      3. REST API, Apache Thrift etc.

    5. Performance

      1. Read optimization

      2. Write optimization

      3. JVM/OS/DFS tuning

      4. Good practices

    6. Troubleshooting

      1. Log files

      2. Tools

    7. Security

      1. Authentication and authorization access

      2. Data security

    8. HBase administration

      1. Installation and configuration

      2. Frequent administration tasks (operations manual)

      3. Upgrade, migration, backup, data snapshots

      4. Adding/Removing nodes from/to replica/cluster, nodes resynchronization

      5. Administration panels and tools

    9. Apache HBase versus other NoSQL databases

      1. Apache Accumulo

      2. Apache Cassandra

  5. Apache Hive

    1. Architecture

    2. Hive features

    3. HiveCLI

    4. HiveQL

    5. PigLatin vs HiveQL

    6. Tables in Hive

    7. Hive administration

      1. Installation and configuration

        1. Hive Metastore

        2. HCatalog

        3. WebHCat

      2. Frequent administration tasks

      3. Upgrade

      4. Administration panels and tools

  6. Apache Avro

    1. Apache Avro IDL

    2. Datatypes

    3. Serialization

    4. Avro RPC

  7. Apache Mahout

    1. Machine learning, data mining

    2. Mahout

      1. Classification algorithms

      2. Clustering algorithms

      3. Evolutionary and genetic algorithms

      4. Dimensionality reduction

      5. Others

    3. Installation and configuration

    4. Apache Mahout and Apache Hadoop

    5. Examples

  8. Data oriented applications

    1. Apache Oozie

      1. MapReduce

      2. Pig

      3. Hive

      4. Subworkflow

    2. Cascading

  9. Management of Apache Hadoop & Family

    1. Apache ZooKeeper

    2. Apache Flume

    3. Apache Ambari

  10. Others

    1. Apache Storm

    2. Apache Spark

    3. Cascalog

Course Length

5 days

Order Course

2995 EUR EUR per participant


How can we help?

Drop us a line and we'll respond as soon as possible.