Training Course

Big Data processing with Apache Spark

Course intended for

Course main audience are software engineers and data analysts, who want to learn the basis of Big Data processing, which surpasses the capabilities of traditional processing, using tools of Apache Spark family. The course is both for people interested in starting to work with Big Data, as well as, people with previous experience in other Big Data systems, such as Apache Hadoop, who want to learn new technology.

Course objective

The attendees will learn about new problems that arise during Big Data analysis from various sources using Apache Spark family tools. During the course a general set of typical Big Data problems and their solutions using Apache Spark will be presented. Moreover, the attendees will have a general overview of the pros and cons of using Apache Spark for their business problem solving. In addition, the course allows the attendees to familiarize themselves with fast-moving Big Data processing field and the novel approach to problem solving that Apache Spark presents.

Course strengths

The course is conducted by engineers that have practical work experience with Big Data problems in their everyday practice. Hence, the material often goes beyond the common textbook information, that are often fragmented. Moreover, the content of the training is continuously updated following the modern advancements in the field. After the course the graduate will have a broad view of Big Data problem solving using Apache Spark family tools for their specific business cases.

Requirements

The course requires experience in programming in Scala or Python; preferred training language is Python. Useful skills are: experience in data processing, functional programming, distributed processing, *nix systems.

Course parameters

2*8 hours (includes 1 hour of breaks each day) of lectures and workshops.

Course Agenda
  1. Introduction to Big Data
    1. What is Big Data?
    2. History of Big Data
    3. Stakeholders in Big Data project
    4. Big Data problems
    5. Big Data processing types
      • Batch
      • Stream
  2. Apache Spark
    1. Introduction
    2. History
    3. Spark vs Hadoop
    4. MapReduce paradigm
    5. Resilient Distributed Datasets (RDDs)
    6. Processing in memory vs from disk
    7. Architecture
    8. Operation variants
    9. Administration
  3. Spark Core
    1. Introduction
    2. Java vs Spark vs Python
    3. Connecting to cluster
    4. Dataset distribution
    5. RDD operations
      • Transformations
      • Actions
    6. Shared variables
    7. Execution and testing
    8. Job tuning
  4. Spark SQL
    1. Introduction
    2. Basic operation
    3. Data and schema
    4. Queries
    5. Hive integration
    6. Execution and testing
  5. Spark Streaming and Structured Streaming
    1. Introduction
    2. Basic operation
    3. Streams
      • Input
      • Transformation
      • Output
    4. Execution and monitoring
  6. Spark ML
    1. Introduction
    2. Transformers and estimators
    3. Pipelines
    4. Training models
Course Length

2 days

Order Course

650 EUR (online) per participant

Our clients

Societe Generale logo
ING
CapGemini
BLStream
Lufthansa Systems logo

Reservation Form

Please fill the form below to reserve your seat or request additional information.

Course name: Big Data processing with Apache Spark





Call Us
(+48) 22 203 56 00
Nowogrodzka 62C, Warsaw, Poland
office@sigdelta.com
UTC / GMT +1



Get social
Treble