Home About Courses Schedule Services Webinars Contact Search

Advanced Apache Spark

SEE SCHEDULE

Duration: 4 Days

Method: Instructor led, Hands-on workshops

Price: $2250.00

Course Code: AP3001


Description

The Advanced Spark training course provides a deeper dive into Spark. Information on internals as well as debugging/troubleshooting Spark applications are a central focus. Also covered is integration with other storage like Cassandra/HBase and other NoSQL implementations. The Advanced Spark course begins with a review of core Apache Spark concepts followed by lessons on understanding Spark internals for performance. Next, the course dives into the new features of Spark 2 and how to use them. The course then covers clustering, integration, and machine learning with Spark. The course concludes with lessons on advanced Spark SQL and streaming, high-performance Spark applications and best practices.

Objectives

Upon successful completion of this course, the student will be able to:

  • Apply the Spark fundamentals to gain a deeper understanding of Spark internals
  • Identify the operational tweaks to gain the maximum performance from Spark
  • Describe how to use GraphX and MLib for machine learning

Prerequisites

Developers who have taken an Introduction to Spark course or who have equivalent experience.

Topics

  • I. Review of core Apache Spark concepts
    • How Spark Works
    • RDD Fundamentals
    • SparkSQL and DataFrames
    • Spark Streaming Concepts
    • Machine Learning Basics
  • II. Understanding Spark Internals for Performance
    • Scheduling, Jobs, and Tasks
    • Data Structures Data, Sets and Data Lakes
    • Shuffle and Performance
    • Understanding Data Sources and Partitions
    • Read, Writes and Performance
  • III. New Features of Spark 2
    • API Stability
    • Core and Spark SQL Changes
    • Changes to Packaging and Operations
  • IV. Working with Spark
    • Debugging/Troubleshooting Spark Applications
    • Developing Data Workflows
    • Automated Spark Builds using Maven
  • V. Clustering with Spark
    • Running a Spark Cluster
    • Cluster Resource Requirements
    • Managing Memory on ExecutorsWorker
    • Managing Memory/Cores Across a Spark Cluster Performance Tuning
    • Best Practices
  • VI. Spark Integration
    • Implementing Spark on DataStax, Hortonworks, etc. Integration with Cassandra
    • Integration with Kafka
    • Integration with Elassticsearch
    • Integration with other Compatible NoSQL implementations (as desired)
  • VII. Machine Learning with Spark
    • Common Algorithms
    • Commonly Used Algorithms with Scala
    • Machine Learning Libraries: MLLib, H20
    • Writing Custom Algorithms
  • VIII. Advanced Spark SQL and Spark Streaming
    • Leveraging Spark 2 API (Spark Session etc)
    • Developing with Spark Dataframes
    • Writing Solid Spark Jobs
    • When to Use Spark and When to not use Spark
  • IX. High-Performance Spark Applications
    • Performance Tuning Process
    • Performance Tuning Metrics
    • SQL Performance Tuning
    • High performant Caching Strategies
    • Cluster Resource Requirements
    • Creating Fault-Tolerance
  • X. Best Practices and Q/A