Computer Science
Algorithm
Data Processing
The Proper Way to Use Spark Checkpoint (2015)
Digital Life
Distributed System
Distributed System Infrastructure
Machine Learning
Operating System
Android
Linux
MacOS
Tizen
Windows
iOS
Programming Language
C++
Erlang
Go
Scala
Scheme
Type System
Software Engineering
Storage
UI
Flutter
Javascript
Virtualization
Life
Life in Guangzhou (2013)
Recent Works (2013)
东京之旅 (2014)
My 2017 Year in Review (2018)
My 2020 in Review (2021)
十三年前被隔离的经历 (2022)
A Travel to Montreal (2022)
My 2022 in Review (2023)
Travel Back to China (2024)
A 2-Year Reflection for 2023 and 2024 (2025)
Projects
Bard
Blog
RSS Brain
Scala2grpc
Comment Everywhere (2013)
Fetch Popular Erlang Modules by Coffee Script (2013)
Psychology
耶鲁大学心理学导论 (2012)
Thoughts
Chinese
English

The Proper Way to Use Spark Checkpoint

Posted on 03 Nov 2015, tagged spark

These days I’m using Spark streaming to process real time data. I’m using updateStateByKey, so I need to add checkpointing, which is a fault tolerance mechanism of Spark streaming. The checkpoint will save DAG and RDDs. So when you restart the Spark application from failure, it will continue to compute.

But there is a problem with checkpointing: you cannot load the checkpointed data once you change the class structure of your code, so the state in updateStateByKey is lost. This is a pretty big limit. Another solution is to save and load data by ourself, but in this way checkpointing is totally useless and will also break the fault tolerance. What about to use both ways? Then the data may load twice while the application is auto restarted by the Spark cluster, in the case of failure. So I asked this question in the Spark user list and somebody kindly give me a solution: use updateStateByKey with the parameter initialRDD.

The answer is a little simple, so I will explain it here. This way is to use both checkpointing and our own data storage mechanism. But we load our data as the initalRDD of updateStateByKey. So in both situations, the data will neither lost nor duplicate:

  1. When we change the code and redeploy the Spark application, we shutdown the old Spark application gracefully and cleanup the checkpoint data, so the only loaded data is the data we saved.

  2. When the Spark application is failure and restart, it will load the data from checkpoint. But the step of DAG is saved so it will not load our own data as initalRDD again. So the only loaded data is the checkpointed data.