Optimizations in spark pipeline

Optimizations in spark pipeline

on September 15, 2021

Optimizations in spark pipeline

We can handle optimizations in spark by following below scenarios:

File formats
Proper cache handling
Persist
Using DataFrame
Using coalesce and Repartition
Broadcast variables or tables
Choose right partition column
Memory allocation

File formats:

We need to make best use of existing file formats.
Avro can be used to perform quick writes.
Parquet used even we need quick reads, due to columnar storage we can read only required column from whole parquet file.
Use Snappy compression.

Proper cache handling:

Cache the dataframe which is getting used multiple in same spark execution flow because it will prevent creating the same dataframe multiple times when ever it is accessed.
Cache will store the dataframe in memory and provides us whenever required without recreating.

Persist:

Depending upon the cluster configuration we can use best storage level to persist dataframe in memory or disk.

Using DataFrame:

Dataframes has best optimizations when compared to RDD's. Dataframe has Tungsten(Using the on-heap memory to avoid garbage collection) and catalyst optimizer( Handling best execution plan and cost optimization) .

Using coalesce and Repartition

Use coalesce when you want to decrease the number of partitions and bring data to required number of partitions.
Repartition to pull data from multiple nodes and shuffle across nodes to distribute data uniformly, this is best to spend data across all nodes and when we want to increase the partitions.

Broadcast variables or tables:

Broadcast will reduce the data shuffling and decreases the execution time.
Small tables has to be broadcasted so that it will reside in each node while joins so that we can prevent data shuffle.
We can distribute the values or list or map across all nodes which is used in multiple transformations.

Choose right partition column:

It will help to distribute the data equally in all the partitions.
If we don't use proper partition key it can lead to data skewness.

Memory allocation:

Based on the cluster configuration and data size we have to provide best executers, memory, CPU cores configuration.
If you have an OOM issue try to see where you can optimize the code don't blindly increase the memory and executers.

Comments