Lessons Learned

Operational lessons from working sessions. Read this when starting work on infrastructure, playbooks, or tooling — these are things that went wrong and cost time.

Ansible: use ignore_errors on non-critical tasks

Non-critical side-effect tasks in playbooks (note registry updates, logging, awsctl reloads) must have ignore_errors: yes. A failure in one of these must not abort the play and block essential follow-on tasks.

What went wrong: In launch_instance.yml, the "Update hosts note" task failed (HTTP 404 — the hosts note did not exist yet). This aborted the play before the known_hosts tasks ran. The newly provisioned host could not be reached by subsequent playbooks because its SSH key was never added to ~/.ssh/known_hosts.

Rule: If a task's failure does not make the host unusable, add ignore_errors: yes. Essential tasks (disk format, package install, SSH config, key registration) must still fail loudly.

simplejson vs built-in json: NaN handling

If simplejson is installed, requests uses it instead of Python's built-in json. The built-in silently serialises float('nan') as the bare token NaN (invalid JSON but no exception). simplejson is strict and raises ValueError: Out of range float values are not JSON compliant.

Root cause pattern: a pandas DataFrame built from records where some rows are missing a field (e.g. a court-booking email with no coach field) has NaN for that column. df.to_dict('records') produces dicts where the key exists with value NaN. Calling .get('coach', '') returns NaN — not the default — because the key is present. Those NaNs then fail JSON serialisation.

Fix: sanitise before serialising. In the row-building step use: def _safe(v): return '' if isinstance(v, float) and v != v else v. No import needed — NaN is the only float not equal to itself (IEEE 754).

version 1  ·  created 2026-06-05  ·  updated 2026-06-05