Apache Spark Key Concepts

Sanjay Parajuli
3 min readDec 28, 2022

--

This blog briefly explains Apache Spark and its key concepts, which are worth knowing.

Apache spark is a big data processing framework for cluster/distributed computing. Being successor of Hadoop MapReduce, it is capable for Data Engineering, Data Science and Machine Learning at scale.

Architecture

Spark at its core is a processing engine which can run on a standalone PC and scale up to large clusters using cluster resource managers like: yarn, mesos, etc.

Spark can read/write from/to different storage services such as: s3, hdfs, mongodb, relational databases and many more. It is used for Stream/Batch processing, data science , SQL Analytics at scale by using various libraries available.

source: www.altexsoft.com

Main Operations in Spark

While running spark tasks, there are mainly two operations involved: Transformations and Actions.

1. Transformations

Transformations are functions which produce new RDD by taking RDD as inputs. For e.g. map, filter, etc. Transformations are lazy in nature i.e. they are not executed until some actions is performed.
Transformations can be further divided into two types: Narrow and Wide. Transformations which involve movement of data between workers (i.e. shuffle/exchange) are known as wide transformations and operations which can be done within a worker are categorized as narrow transformations.
Narrow transformations can be performed in parallel whereas wide transformations can’t be performed in parallel.

2. Actions

Functions which produces result (actual) by triggering transformations to run. For e.g. show(), count(), etc. Values returned by some action is returned to driver or written to external file system. It is the way to collect data from executors to driver.

Spark Concepts

There are few major concepts in spark which are must to know before getting started. Lets see these key spark concepts and terminologies.

Cluster

Cluster is a general term in Distributed Computing. Its just a collection of machines in which a machine(node) works as a driver(master) and other machines work as workers(executors). One worker node can contain one or more executors.

RDD

Resilient Distributed Dataset (RDD) is a fundamental data structure of spark. Data is stored in RDDs during processing. RDDs are read-only which are stored completely in memory.

Dataset and DataFrame

DataFrame is a distributed collection of data organized into named columns which is similar to table in databases or dataframe in Python. Often times, terms `DataFrame` and `Dataset` are used interchangeably. A DataFrame is represented by a Dataset of RowS ( Dataset[Row] in scala) and dataset is strongly typed and structured.

Partitioning

Partition is a logical division of data with in an RDD, DataFrame, or Dataset. Partitions allow Spark to parallelize the processing of data by dividing it into smaller chunks that can be processed independently.

By default, Spark tries to create as many partitions as there are cores in the cluster, but you can also specify the number of partitions when creating an RDD, DataFrame, or Dataset.
Existing Data can be repartitioned using Repartition and Coalesce transformation functions. The repartition function will shuffle the data and create a new set of partitions, while coalesce will try to combine existing partitions to create fewer partitions.

Spark Core

When the spark program(job) executes, driver program (mainly sparkcontext) first analyzes and creates physical plan by determining the possible tasks to be performed. It then uses cluster resource manager (e.g. Yarn) to schedule and distribute tasks. When executors are done processing, they return result to the driver.
Inside executor, there are multiple cores(vCPUs), each of which can perform a spark job (subjob in actual).

DAG

A visual representation of tasks performed in a job. Every shuffle/exchange operation creates new stage in dag.

Client or Cluster Mode

In client mode, driver program runs in the same node from which spark-submit is run whereas in cluster mode, driver program runs on any of worker nodes and, client which submitted job can disconnect.

Thanks! Hope you found this blog helpful.

--

--