Broadcast in Spark

What is Broadcast join and how to use in Spark.

Broadcast Join will distribute the broadcasted table to all the executers when job starts so that all the executers have the joining table in their local executers so shuffles reduce and decreases the time of execution.

We have to only broadcast small table always.

from pyspark.sql.functions import broadcast

joined_df  =  marks.join(broadcast(location), location["id"] == marks["id"],"inner")

We can broadcast a variable which will be passed to all executers. example we can broadcast Array of values. Broadcast variable are read only variables we cannot modify them.

broadcastVar = sc.broadcast(Array(1, 2, 3))
if we want to access it we have to call it as broadcastVar.value
We can increase the broadcast timeout value if we are facing timeout issues after broadcasting
example : spark.sql.broadcastTimeout=1800


Comments