Computer Science

Algorithm

任意进制数的补码 (2011)

快排杀手 (2011)

How to Estimate Max TPS from TPM (2020)

Data Processing

Use Docker to Submit Spark Jobs (2015)

The Proper Way to Use Spark Checkpoint (2015)

Use Redis Instead of Spark Streaming to Count Statistics (2015)

Digital Life

Move from Twitter to Mastodon (2020)

What Is Wrong about Recommendation System (2020)

Use RSS and Kindle to Read News (2020)

Matrix: A Self Hosted Instant Messaging Solution with End to End Encryption (2020)

An Overview of China's Internet Censorship Strategy (2020)

Deploy Matrix for Users in China (2020)

Random Playlists for Self Hosted Videos (2024)

Distributed System

Spanner and Open Source Implementations (2018)

Great Resources for Learning Database and Distributed System (2019)

Aurora Database (2020)

Understand Liveness and Fairness in TLA+ (2020)

Use TLA+ to Verify Cache Consistency (2020)

Keep Data Consistency During Database Migration (2020)

Redis Implementation for Cache and Database Consistency (2020)

Jepsen Test on Patroni: A PostgreSQL High Availability Solution (2024)

Distributed System Infrastructure

Infrastructure Setup for High Availability (2023)

Upgrade Kubernetes from 1.23 to 1.24 (2023)

How to Cleanup Ceph Filesystem for Deleted Kubernetes Persistent Volume (2023)

Introduce K3s, CephFS and MetalLB to My High Avaliable Cluster (2023)

Replace A Dead Node in My High Available Cluster (2025)

Machine Learning

Backpropagation Algorithm (2015)

My Recent Work About Neural Networks (2015)

Install BLAS Library for MXNet (2015)

How to Put RNN Layers Into Neural Network Model (2016)

Build A Computer for Deep Learning (2016)

Operating System

Android

The Permission Management of Android Becomes A Bigger Problem When It Comes to Wearable Devices and TV (2016)

Root And Optimize MiBox 3S (2018)

Linux

端口666：毁灭！ (2011)

Chroot 简介 (2012)

Compile And Install Kernel (2012)

Backup My Dotfiles (2012)

Comparison Between Linux Desktop Environments (2012)

How Kernel's Makefile Specify Output Directory (2013)

Xbmc on Raspberry Pi with Archlinux (2013)

Setup SSH Authentication with YubiKey (2021)

In Defence of Disabling Swap (2021)

Build a Linux Virtual Machine for Windows Apps (2023)

Linux Full Disk Encryption with Yubikey (2023)

A Review of Linux on Surface Pro 4 (2024)

MacOS

My MacOS Essentials (2024)

Tizen

Tizen，加油 (2012)

Windows

Build a Unix Like Environment on Windows (2016)

iOS

DNS Resolving Bug in iOS 14 (2020)

Handle Apple In-App-Purchase Server Notification with Scala/Java (2022)

Programming Language

C++

也谈C++ (2012)

Erlang

Build Erlang the Rebar Way (2013)

Fetch Popular Erlang Modules by Coffee Script (2013)

Why I Come Back to Erlang (2014)

Experiment On Combining OOP With Erlang's Actor Model (2014)

Go

Notes On Go Scheduler (2014)

Scala

Config sbt to Use Both Proxy and Self Hosted Repositories (2016)

Compare Task Processing Approaches in Scala (2023)

A Boring JVM Memory Profiling Story (2023)

Scala 2 Macro Tutorial (2023)

SBT Task to Build Frontend Components (2024)

Scheme

SICP第三章总结（上）——可变量与环境 (2012)

SICP第三章总结（下）——流编程 (2012)

Type System

RESTful API with Type System (2014)

Powerful Type System (2020)

Software Engineering

Call Program Like A Function (2012)

More About Program In Shell And Function (2013)

Server Logic of Level Based Games (2013)

Languages Should Have Database Built In (2013)

How About Translate IMAP And SMTP Into HTTP API? (2015)

The Things You Need to Know When Using Apache Sentry (2018)

Define Infrastructure as Code (2021)

Why Big Companies Need to Adopt Open Source (2021)

Storage

Change Root File System from Ext4 to Xfs on Archlinux (2013)

Migrate Arch Linux to ZFS (2020)

Personal ZFS Offsite Backup Solution (2021)

ZFS Profiling on Arch Linux (2023)

UI

Flutter

Make Flutter Web Apps More Native Like (2024)

Javascript

Beautiful Math with MathJex (2012)

HTML + CSS + JS is Good (2014)

What is Wrong about HTML and CSS (2014)

Prevent htmx Lazy Loaded Content From Reloading (2024)

Create a Checkbox That Returns Boolean Value for htmx (2024)

Virtualization

Create A Virtual Machine Network (2012)

Fedora Virt-manager Guest Connect to Host (2013)

Docker Is the One Scaffolding to Rule Them All (2014)

Life

Life in Guangzhou (2013)

Recent Works (2013)

东京之旅 (2014)

My 2017 Year in Review (2018)

My 2020 in Review (2021)

十三年前被隔离的经历 (2022)

A Travel to Montreal (2022)

My 2022 in Review (2023)

Travel Back to China (2024)

A 2-Year Reflection for 2023 and 2024 (2025)

Travel Back To China: 2025 Edition (2025)

Projects

Bard

The Thoughts Behind Bard Framework (2014)

Why Use Reflections to Write A Web Framework (2014)

Blog

My New Blog Website (2012)

Comment And Search Are Available (2012)

Remove Categories (2012)

Add Index to My Blog (2021)

Jekyll Plugin to Load Asciinema Recordings Locally (2023)

Add Index Sidebar to My Blog (2023)

RSS Brain

RSS Brain: Yet Another RSS Reader, With More Features (2022)

How RSS Brain Shows Related Articles (2022)

Update on RSS Brain to Find Related Articles with Machine Learning (2023)

Source Code of RSS Brain is Available (2024)

Scala2grpc

A Library to Make It Easier to Use Scala with gRPC (2022)

Migrate Scala2grpc to Cats Effect 3 (2023)

Comment Everywhere (2013)

Fetch Popular Erlang Modules by Coffee Script (2013)

Psychology

耶鲁大学心理学导论 (2012)

Thoughts

Chinese

关于人的思想 (2008)

关于人的思想（续） (2008)

览《中国文化要义》有感 (2011)

未来人们怎样对待坏人：读《理想国》杂想 (2011)

好玩的生命游戏 (2011)

念天地之悠悠 (2012)

摒弃现代科技的隐士生活 (2015)

读《邓小平时代》有感 (2016)

由“废青”这个称呼所想到的 (2019)

盛唐诗人和远游 (2020)

English

Tired of Programming (2013)

The Tragic Talented Programmer (2020)

Use Docker to Submit Spark Jobs

Posted on 29 Apr 2015, tagged DockerSpark

These days I’m working on analyze data with Spark. Since our Spark cluster is offline in the office for now, so it needs to download data from log server every hour, analyze them with Spark and then upload to the server. The work flow is a little complex so I write some scripts to do it. In addition, I also write a whole automated end to end test for it.

The whole thing is messed up with cron job configurations, shell scripts, MongoDB scripts and Spark jobs. Then I realize, I can pack them into a container, which do all the dirty things while not mess up the outside world. The better thing is, since I am using the CDH cluster, I can download the YARN configuration while build the container.

Here is how the container is built up:

Compile the spark jobs and run unit tests.
Install Hadoop client, Spark and Hive client, and download configuration files from CDH cluster.
Install cron jobs to the system.
Generate data and test if the whole work flow works.

The container will do these things every hour through cron job:

Download data from log server.
Sends the data to the CDH cluster and submit the spark job to it.
Fetch result data from CDH cluster and upload them to the online server.

You may wonder why not just use some work flow tools in Hadoop ecosystem like Oozie? First of all, Oozie uses XML as its config file which I think is very complex. And it is also less flexible. With Docker container, I can build it and test it without touch the outside world. And it is just a very flex component that can be attached to the Spark cluster easily with a single command. Or think the Spark cluster as a low level service, which just provides computing resource and should not care about other things.