在Linux上使用flock命令控制程序的异步执行

最近我常常需要同时ssh给若干台电脑做许多需要等待，而且可以同时进行的工作。例如：

让远端电脑同时更新套件
同时传送小档案给远端的电脑（时间大部分在ssh认证）

然而之后的动作又需要在确认上述工作完毕之后，才能继续进行。

过去我都是这样做：

# 前面的工作
update_pkg_on_machine_1
update_pkg_on_machine_2
update_pkg_on_machine_3
# ... 后面的工作

这样虽然可以确保工作同时进行完毕，但是就是很慢…

另一种可能的方法是：

# 前面的工作
update_pkg_on_machine_1 &
update_pkg_on_machine_2 &
update_pkg_on_machine_3 &
sleep 10
# ... 后面的工作

这样子虽然可以同时进行工作，但是如果10秒内工作还没完成，接下来的工作可能就会出错了。

而工作要在多少秒之内做完，其实是很难掌握的。

利用`flock`来管理工作状态

我过去在自修作业系统的时候，有学到mutex这个东西，而flock就是可以在shell上使用的mutex。

`flock`的官方说明

我们先看一下flock 在ubuntu lucid上的说明:

NAME
       flock - Manage locks from shell scripts

SYNOPSIS
       flock [-sxon] [-w timeout] lockfile [-c] command...

       flock [-sxon] [-w timeout] lockdir [-c] command...

       flock [-sxun] [-w timeout] fd
DESCRIPTION
       This  utility  manages  flock(2) locks from within shell scripts or the
       command line.

       The first and second forms  wraps  the  lock  around  the  executing  a
       command,  in  a  manner  similar  to  su(1)  or  newgrp(1).  It locks a
       specified file or directory, which  is  created  (assuming  appropriate
       permissions), if it does not already exist.

       The  third form is convenient inside shell scripts, and is usually used
       the following manner:

       (
         flock -s 200
         # ... commands executed under lock ...
       ) 200>/var/lock/mylockfile

       The mode used to open the file doesn’t matter to flock; using >  or  >>
       allows  the  lockfile  to  be  created  if  it  does not already exist,
       however, write permission is required; using < requires that  the  file
       already exists but only read permission is required.

       By  default,  if  the  lock cannot be immediately acquired, flock waits
       until the lock is available.

OPTIONS
       -s, --shared
              Obtain a shared lock, sometimes called a read lock.

       -x, -e, --exclusive
              Obtain an exclusive lock, sometimes called a write  lock.   This
              is the default.

       -u, --unlock
              Drop  a  lock.   This  is  usually not required, since a lock is
              automatically dropped when the file is closed.  However, it  may
              be  required  in  special  cases,  for  example  if the enclosed
              command group may have forked a background process which  should
              not be holding the lock.

       -n, --nb, --nonblock
              Fail  (with  an  exit  code  of  1) rather than wait if the lock
              cannot be immediately acquired.

       -w, --wait, --timeout seconds
              Fail (with an exit code of 1) if the  lock  cannot  be  acquired
              within  seconds seconds.  Decimal fractional values are allowed.

       -o, --close
              Close the file descriptor on  which  the  lock  is  held  before
              executing  command.   This  is  useful if command spawns a child
              process which should not be hold ing the lock.

       -c, --command command
              Pass a single command to the shell with -c.

       -h, --help
              Print a help message.

AUTHOR
       Written by H. Peter Anvin <hpa@zytor.com>.

COPYRIGHT
       Copyright © 2003-2006 H. Peter Anvin.
       This is free software; see the source for copying conditions.  There is
       NO  warranty;  not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
       PURPOSE.

SEE ALSO
       flock(2)

AVAILABILITY
       The flock command is part of the util-linux-ng package and is available
       from ftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/.

重点说明

透过flock，程序会先尝试取得某个lock（通常代表某个档案）的拥有权之后才执行，执行的时候会握有该lock的拥有权，并且在结束之后才释出拥有权。

程序员专属礼品：编程水杯

举例来说，如果我们写一个shell script放在$HOME底下：

#! /bin/bash
sleep 10
date

储存成test.sh并且打开执行权限（chmod 700 test.sh）

此时如果我们打开两个shell, 并且约同时执行：

flock /tmp/demo.lock ~/test.sh

这时会发生什麽事情呢？

使用者应该会看到两个shell都停住，一个等10秒后印出时间，一个再过10秒后印出时间：

wush@router:~$ flock /tmp/demo.lock ./test.sh
Sat Jan 4 00:55:24 CST 2014

wush@router:~$ flock /tmp/demo.lock ./test.sh
Sat Jan 4 00:55:34 CST 2014

其中A程序先抢到/tmp/demo.lock的拥有权，然后执行test.sh。而B程序等到A程序结束之后（A归还/tmp/demo.lock的拥有权)，才拿到/tmp/demo.lock的拥有权。所以B程序自然比A程序慢10秒。

`flock`的参数

除了预设的行为之外，我们可以透过参数来调整flock的行为。和预设行为上最主要的差异在于，当无法获得lock_path的拥有权时，接下来的动作会不同。

flock -n lock_path xxx：当无法获得拥有权的时候，直接中止程序，不执行xxx。
flock -s lock_path xxx：把lock_path当成一个shared lock，同时能被多个程序拥有。所以大家都可以马上执行，而且同时拥有lock_path
flock -x lock_path xxx：把lock_path当成一个exclusive lock，同时只能被一个程序拥有。

注：一个lock_path不能同时为shared和exclusive！

解决简介中的问题

所以透过组合flock，我可以同时执行若干个工作，并且等到他们结束之后再继续执行接下来的工作：

# 前面的工作
flock -s lock_path update_pkg_on_machine_1 &
flock -s lock_path update_pkg_on_machine_2 &
flock -s lock_path update_pkg_on_machine_3 &
flock -x lock_path echo "all done!"
# ... 后面的工作

关键在于flock -x lock_path xxx会因为shared和exclusive互斥的关系，而不能共存。因此就会等到上面的工作都结束（归还lock_path的拥有权）之后才执行。