Course intended for
Course main audience are software engineers and data analysts, who want to learn the basis of Big Data processing, which surpasses the capabilities of traditional processing, using tools of Apache Spark family. The course is both for people interested in starting to work with Big Data, as well as, people with previous experience in other Big Data systems, such as Apache Hadoop, who want to learn new technology.
Course objective
The attendees will learn about new problems that arise during Big Data analysis from various sources using Apache Spark family tools. During the course a general set of typical Big Data problems and their solutions using Apache Spark will be presented. Moreover, the attendees will have a general overview of the pros and cons of using Apache Spark for their business problem solving. In addition, the course allows the attendees to familiarize themselves with fast-moving Big Data processing field and the novel approach to problem solving that Apache Spark presents.
Course strengths
The course is conducted by engineers that have practical work experience with Big Data problems in their everyday practice. Hence, the material often goes beyond the common textbook information, that are often fragmented. Moreover, the content of the training is continuously updated following the modern advancements in the field. After the course the graduate will have a broad view of Big Data problem solving using Apache Spark family tools for their specific business cases.
Requirements
The course requires experience in programming in Scala or Python; preferred training language is Python. Useful skills are: experience in data processing, functional programming, distributed processing, *nix systems.
Course parameters
2*8 hours (includes 1 hour of breaks each day) of lectures and workshops.
Course Agenda
- Introduction to Big Data
- What is Big Data?
- History of Big Data
- Stakeholders in Big Data project
- Big Data problems
- Big Data processing types
- Batch
- Stream
- Apache Spark
- Introduction
- History
- Spark vs Hadoop
- MapReduce paradigm
- Resilient Distributed Datasets (RDDs)
- Processing in memory vs from disk
- Architecture
- Operation variants
- Administration
- Spark Core
- Introduction
- Java vs Spark vs Python
- Connecting to cluster
- Dataset distribution
- RDD operations
- Transformations
- Actions
- Shared variables
- Execution and testing
- Job tuning
- Spark SQL
- Introduction
- Basic operation
- Data and schema
- Queries
- Hive integration
- Execution and testing
- Spark Streaming and Structured Streaming
- Introduction
- Basic operation
- Streams
- Input
- Transformation
- Output
- Execution and monitoring
- Spark ML
- Introduction
- Transformers and estimators
- Pipelines
- Training models