INTRODUCTION TO DATA ANALYSIS WITH SPARK:
1.What is apache spark?
2.A united stack
3.Spark core
4.Spark sql
5.Spark streaming
6.MLlip
7.GraphX
8.Cluster managers
9.Who uses spark,and for what?
10.Data science tasks
11.Data processing applications
12.A brief history of spark
13.Spark versions and releases
14.Storage layers for spark
DOWNLOADING SPARK AND GETTING STARTED:
1.Downloading spark
2.Introduction to spark’s python and scala shells
3.Introduction to core spark concepts
4.Standalone applications
5.Initializing a spark context
6.Building standalone applications
7.Conclusion
PROGRAMMING WITH RDDs:
1.RDD basics
2.Creating RDDs
3.RDD operations
4.Transformations
5.Actions
6.Lazy evaluation
7.Passing functions to spark
8.Python
9.Scala
10.Java
11.Common transformations and actions
12.Basic RDDs
13.Converting between RDD types
14.Persistense(caching)
15.Conclusion
WORKING WITH KEY/VALUE PAIRS:
1.Motivations
2.Creating pair RDDs
3.Transformations on pair RDDs
4.Aggregations
5.Grouping data
6.Joins
7.Sorting data
8.Actions available on pair RDDs
9.Data partitioning(advanced)
10.Determining an RDDs partitioner
11.Operations that benefit from partitioning
12.Operations that affect partitioning
13.Example :page rank
14.Custom partitioner
15.Conclusion:
LOADING AND SAVING YOUR DATA :
1.Motivation
2.File formats
3.Text files
4.JSON
5.Comma-seperated values and tab -seperated values
6.Sequence files
7.Object files
8.Hadoop input and output formats
9.File compression
10.Filesystems
11.Local/”regular”FS
12.Amazon s3
13.HDFS
14.Structured data with spark SQL
15.Apache hive
16.JSON
17.Databases
18.Java database connectivity
19.Cassandra
20.Hbase
21.Elasticseach
22.Conclusion
ADVANCED SPARK PROGRAMMING:
1.Introduction
2.Accumalators
3.Accumalators and fault tolerance
4.Custom accumalators
5.Broadcast variables
6.Optimizing broadcasts
7.Working on a per-partition basis
8.Piping to external programs
9.Numeric RDD operations
10.Conclusion
RUNNING ON A CLUSTER:
1.Introduction
2.Spark runtime architecture
3.The driver
4.Executors
5.Cluster manager
6.Launching a program
7.Summary
8.Deploying applications with spark-submit
9.Packaging your code and dependencies
10.A java spark applications built with maven
11.A scala spark applications built with sbt
12. Dependency conflicts
13.Scheduling with in and between spark applications
14. Cluster managers
15.Standalone cluster manager
16.Apache mesos
17.Amazon EC2
18.Which cluster manager to use ?
19.Conclusion
TUNING AND DEBUGGING SPARK:
1.Configuring spark with sparkconf
2.Components of execution :jobs,tasks ,and stages
3.Finding information
4.Spark web UI
5.Driver and executor logs
6.Key performance considerations
7.Level of parallelism
8.Serialization format
9.Memory management
10.Hardware provisioning
11.Conclusion
SPARK SQL:
1.Linking with spark SQL
2.Using spark SQL in application
3.Initializing spark SQL
4.Basic query example
5.Schema RDDs
6.Caching
7.Loading and saving data
8.Apache hive
9.Parquet
10.JSON
11.From RDDs
12.JDBC/ODBC server
13.Working with beeline
14.Long -lived tables and queries
15.User-defined functions
16.Spark SQL UDfs
17.Hive UDfs
18.Spark SQL performance
19.Conclusion
SPARK STREAMING:
1.A simple example
2.Architecture and abstraction
3.Transformations
4.Stateless transformations
5.Output operations
6.Input sources
7.Core sources
8.Additional sources
9.Multiple sources and cluster sizing
10.24/7 operation
11.Checkpointing
12.Driver fault tolerance
13.Worker fault tolerance
14.Receiver fault tolerance
15.Processing guarantees
16.Streaming UI
17.Performance considerations
18.Batch and window sizes
19.Level of parellelism
20.Garbage collection and memory usage
21.Conclusion
MACHINE LEARNING WITH MLlip:
1.Overview
2.System requirements
3.Machine learning basics
4.Example:spam classfication
5.Data types
6.Working with vectors
7.Algorithms
8.Feature extraction
9.Statistics
10.Classifications and regression
11.Clustering
12.Collaborative filtering and recommendation
13.Dimensionality reduction
14.Model evaluation
15.Tips and performance considerations
16.Preparing features
17.Configuring algorithms
18.Caching RDDs to reuse
19.Recognizing sparsity
20.Level of parallelism
21.Pipeline API
22.Conclusion