04 — The lock daemon: the hardest problem

See: story index

Why a child process?

WebDAV LOCK/UNLOCK requires live libcurl calls. libcurl is not async-signal-safe. write(2) is.

pwsafe's UnlockFile is called from normal code paths, atexit handlers and C++ destructors (curl may already be torn down), and signal handlers. Any abnormal exit with the naïve approach could leave the server lock held indefinitely — the next open of the same database by any client returns HTTP 423 Locked.

The fork + socketpair design

On the first pws_lockd_acquire() call, the transport layer forks a child process ('lock daemon') connected to the parent via socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, ...).

The child runs a simple command loop:

for ever:
    read opcode (1 byte)
    if EOF:  unlock all held locks, _exit(0)   ← parent crashed
    if QUIT: unlock all, send OK, _exit(0)     ← clean shutdown
    read url (length-prefixed)
    dispatch to t->lock / t->unlock / t->store
    send response (status byte + errno uint32)

The parent's pws_lockd_acquire/release/store functions only do write(2) and read(2) — both async-signal-safe per POSIX.

The binary IPC protocol

The first implementation used a newline-delimited text protocol. The o3 security audit immediately flagged this as a critical injection vulnerability: a URL containing \n could inject additional commands.

The fix was length-prefixed binary frames:

Parent → Child:
  [uint8_t opcode]
  [uint32_t url_len, little-endian]
  [url_len bytes]
  For STORE only: [uint32_t path_len LE] [path_len bytes]
  QUIT: opcode byte only, no payload

Child → Parent:
  [uint8_t status: 0=OK 1=ERR]
  [uint32_t errno, little-endian]

A URL can now contain any byte value — including \n, space, NUL — without affecting framing. recv_string rejects fields longer than MAX_IPC_URL (8 192 bytes) or MAX_IPC_PATH (4 096 bytes) before allocating.

The state-divergence problem

s_lock_tokens (in the WebDAV plugin) is populated when the child calls t->lock(). The parent's copy of this map remains permanently empty. If the parent called t->store() directly, it checked its own empty map, omitted the If: (<token>) header, and got HTTP 423 back. The in-memory changes were silently lost.

Fix: FClose tests pws_has_lock(url). If true, it calls pws_lockd_store(cache, url) — which sends a CMD_STORE frame to the child. The child's t->store() sees the correct token and the PUT succeeds.

Shutdown paths

• Normal UnlockFile — parent sends CMD_UNLOCK; child calls t->unlock() immediately

• Clean exit — LockdGuard static destructor → pws_lockd_shutdown() → CMD_QUIT → child unlocks all and exits → parent waitpid reaps child

• SIGKILL / crash — parent fd closed by kernel; child reads EOF → child_unlock_all()_exit(0)

• Child killed — parent gets EIO on next read(); no recovery; user must reopen the app

The SOCK_CLOEXEC lesson

The initial socketpair call used SOCK_STREAM without SOCK_CLOEXEC. If pwsafe's GUI spawned any child process — a browser for help, a file manager — that child inherited the parent's socket end. The lock daemon then saw the socket stay open even after pwsafe exited, so it never got EOF and never released the server lock.

Fix: SOCK_STREAM | SOCK_CLOEXEC. Found in the self-review after the external audits.

fd cleanup in the child

On fork, the child inherits all of the parent's open file descriptors — X11 sockets, wxWidgets pipes, GTK internals. The child enumerates /proc/self/fd and closes everything >= 3 except the socketpair fd. Signal disposition is also reset (SIGTERM, SIGINT, SIGHUP → SIG_DFL; SIGPIPE → SIG_IGN).

version1
created2026-02-27