Imagine that you built a beautiful spark application and it look great on paper but then when you run it on a huge dataset it just crawls. The promising job turns into a time consuming slog with high resource utilization and you are left wondering where it all went wrong. This sounds familiar right? In the Big Data world , Performance is not a “nice to have” but it is absolutely critical for the application to be successful. If you are aspiring or already working as a data engineer, delivering optimized Spark applications is key to making a name for yourself. In this post, lets explore the various optimization techniques in Spark.
As a data engineer , I am constantly dealing with the challenges of Big data processing and Apache Spark is my go to tool for handling these challenges. Simply using Spark won’t be enough , we need to master the optimization techniques in Spark to truly utilize the power of Spark. I am sharing below some of the tested techniques that helped me better performance out of the Spark jobs.
The Importance of Spark Optimization
Before we jump into the optimization techniques, lets just understand why we need to care about the Spark optimization.
Cost Saving Optimized Spark jobs consume optimal resources which in turn translates to lower infrastructure costs especially when they run in Cloud environments.
Faster Processing In today’s world, the time is money and hence the faster your job runs you get quicker insights out of it
User Experience Performance matters to an end user and optimized job helps greatly in improving the user experience.
Scalability Optimized applications can handle increasing data in an efficient manner
Resource Utilization Optimizing Spark jobs helps in making sure that the resources allocated are being used correctly.
Key Optimization Techniques
Here are some of the most effective optimization techniques I have used .
Data Partitioning Partition is a technique which divides your data into smaller manageable chunks distributed across the cluster. It is basically sorting a massive library into various sections as per the category of books. It enables parallel processing as each partition can be processed independently through different executors. One can create data partitions in Spark using the below functions
repartition(): Used to create specific number of partitions, irrespective of the initial partitions.
Coalesce() : Used mostly To reduce the number of partitions but it is less efficient to increase the partitions.
BucketBy : Bucketing is data organization technique which can help in optimizing the performance of jobs which performs joins . Bucketing groups similar data together into a bucket based on hash values.
Imagine you are analyzing website traffic logs and if you partition the data by date ,you can process only one day’s data and not the entire log history if you are analyzing traffic for a specific day.
from pyspark.sql.functions import col #repartition df_partitioned = df.repartition(50,col("dt")) #coalesce df_coalesced = df_partitioned.coalesce(25) #bucketby df_bucketed = df.write.bucketBy(25,"user_id").saveAsTable("customer_table")
Cache and Persist Cache stores the data in memory for faster access while persist allows you the store the data in both disk as well as memory . This technique avoids recomputing the same data multiple times which is a huge time saver.
cache() - Stores data in memory
persist() - Allows you to specify different storage levels (
MEMORY_AND_DISK
,DISK_ONLY
).unpersist() - to remove cached data
Consider you are running an iterative machine learning algorithm and each iteration needs access to the dataset. Caching the data after initial load prevents you from reading it from disk every time.
#caching df_cached = df.cache() #persist from pyspark import StorageLevel df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK) #unpersist df_persisted.unpersist()
Broadcast Variables - This involves sending a read only variable typically a small lookup table to all executors instead of sending it for each task. This reduces the size of data being distributed and this can speed up some operations like the join.
Consider you have a product catalog and you want to join your sales transactions with this catalog to get product names. Instead of sending the catalog with each task, you can broadcast it once.
# Broadcasting a dictionary product_catalog = {"1": "Laptop", "2": "Mouse", "3": "Keyboard"} broadcast_catalog = spark.sparkContext.broadcast(product_catalog) # Accessing the broadcast variable in a transformation def lookup_product_name(product_id): return broadcast_catalog.value.get(product_id, "Unknown") # Apply the lookup function df_with_names = df_sales.withColumn("product_name", lookup_product_name(col("product_id")))
Choosing the Right Join strategy - Spark provides different ways to perform joins: shuffle hash join, broadcast hash join, and sort-merge join. The choice of which one to use will rely on the size of the datasets and the joining conditions.
When one of the two tables is much smaller than the other, a broadcast join can be significantly faster. Otherwise, one should use a sort-merge join.
#to set autoBroadcastJoinThreshold spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") # 10 MB threshold #Broadcast hint df_join = df_big.join(broadcast(df_small), "id")
Optimizing Data Formats- There are formats like Parquet or ORC which are columnar and support compression. Usage of such formats is important because they are more efficient for analytical queries. Compressed data reduces storage cost and I/O-systems usage.
# writing data in parquet format df.write.format("parquet").mode("overwrite").save("/path/to/parquet/file") # Reading parquet format df_parquet = spark.read.format("parquet").load("/path/to/parquet/file")
Skew handling - Skew occurs when data is unevenly distributed. For instance, when most data in a partition is of one type, it could cause a slowdown. Skew can have a direct impact on the performance of the Spark application. Use salt-key technique and increase parallelism to avoid skewness
Spark UI: Spark UI allows monitoring and debugging of Spark jobs. It is a great tool to tune performance and debug Spark jobs.The Spark UI shows various metrics like the tasks' status, resources used, and the event timeline, etc.
Tips for Continuous Optimization
Profiling: Use the spark UI to visualize your Job and find bottlenecks.
Experiment: Use different approaches and measure the differences.
Stay Updated: Spark is an ever-evolving framework. So continue learning about new features and best practices.
Spark UI: Use spark UI to see how well the spark job is performing.
Monitor the log: If you check the logs, you can figure out the error, leading to taking the right action.
Conclusion
Spark is a powerful tool, but it is only as good as the code you write. By mastering these optimization techniques, you can transform your Spark applications from sluggish to lightning-fast . These skills are highly sought after and they will definitely give you a competitive edge. Start practicing these techniques and see your Spark jobs take off.
What are your favorite techniques for optimizing Spark applications?Have you faced any particularly challenging optimization problems? Let us share our experiences in the comments.