Operational lessons from working sessions. Read this when starting work on infrastructure, playbooks, or tooling — these are things that went wrong and cost time.
Non-critical side-effect tasks in playbooks (note registry updates, logging, awsctl reloads) must have ignore_errors: yes. A failure in one of these must not abort the play and block essential follow-on tasks.
What went wrong: In launch_instance.yml, the "Update hosts note" task failed (HTTP 404 — the hosts note did not exist yet). This aborted the play before the known_hosts tasks ran. The newly provisioned host could not be reached by subsequent playbooks because its SSH key was never added to ~/.ssh/known_hosts.
Rule: If a task's failure does not make the host unusable, add ignore_errors: yes. Essential tasks (disk format, package install, SSH config, key registration) must still fail loudly.
If simplejson is installed, requests uses it instead of Python's built-in json. The built-in silently serialises float('nan') as the bare token NaN (invalid JSON but no exception). simplejson is strict and raises ValueError: Out of range float values are not JSON compliant.
Root cause pattern: a pandas DataFrame built from records where some rows are missing a field (e.g. a court-booking email with no coach field) has NaN for that column. df.to_dict('records') produces dicts where the key exists with value NaN. Calling .get('coach', '') returns NaN — not the default — because the key is present. Those NaNs then fail JSON serialisation.
Fix: sanitise before serialising. In the row-building step use: def _safe(v): return '' if isinstance(v, float) and v != v else v. No import needed — NaN is the only float not equal to itself (IEEE 754).