Date: 2026-02-28. Task: take editorial feedback on a pwsafe LinkedIn article (loaded from a PDF), send it to Envoy, and have Envoy revise the draft stored in notes and save an updated version. The session uncovered 7 distinct issues across code correctness, notes maintenance, model behaviour, infrastructure, and design.
1. Message-ID angle bracket mismatch — code bug, fixed (commit 7f18c41)2. Notes linkage gap — notes maintenance failure, fixed manually3. Gathering phase failed to fall back to email search — model quality / state note weakness4. sendmail IPv6 DNS failure — infrastructure issue, worked around with IMAP injection5. LLM hallucinated WebDAV upload — fundamental hallucination, no fix yet6. Note write failure does not prevent success reply — design gap, partially improved7. JSON parse error on first attempt, silent on second — flakiness, improved logging
Status: Fixed. Commit 7f18c41.
When the orchestrator fetches emails by Message-ID to check they were gathered successfully, the comparison always failed silently. The LLM stores Message-IDs without angle brackets (e.g. abc@host) while the email parser includes them (e.g. <abc@host>). The set membership test ef.message_id in retrieved_ids was always False.
RFC 2822 requires Message-IDs to be enclosed in angle brackets. The email parser (imaplib) faithfully preserves them. LLMs, however, routinely strip the brackets when generating or referring to Message-IDs. Neither side normalised before comparison.
Every email fetch was counted as NOT FOUND. This caused: (a) duplicate fetch attempts, (b) premature marking of fetches as permanently failed after the retry limit, (c) legitimate gathered emails being ignored. The task appeared to proceed normally but on false data.
Added a normalisation helper applied to both sides of the comparison:
def _mid(s): return s.strip().strip('<>') — applied to both ef.message_id and each element of retrieved_ids.
Any identifier passed through both an email parser and an LLM must be normalised before comparison. LLMs routinely alter punctuation that email standards treat as significant. Apply strip('<>') defensively whenever comparing Message-IDs. Extend this principle to any field where standards and LLM output conventions diverge.
Status: Fixed manually (notes updated).
The pwsafe article draft was in notes at pwsafe/writeup/draft-v1 but was not discoverable by Envoy's gathering phase. The parent note pwsafe had no reference to the writeup subtree, and CONTENTS did not link to pwsafe/writeup/index.
The writeup notes were created in a previous session as a standalone bundle. The creating agent did not update the parent pwsafe index note to link to the new subtree, and did not add an entry in CONTENTS. The notes existed in isolation.
Gathering correctly followed CONTENTS → pwsafe but could not navigate to the article draft. The LLM guessed at wrong key paths (projects/pwsafe/*) instead of finding the correct ones, and the task proceeded without the article loaded.
Updated pwsafe note (v7) to add an Articles/Writeup section linking to pwsafe/writeup/index. Updated pwsafe/writeup/index (v2) to reference draft-v1 and include editorial feedback notes. Added pwsafe/writeup/index to the children list of the pwsafe note.
Creating a new note subtree is only complete when the parent index note links to it and CONTENTS is updated. This is the notes equivalent of updating a directory listing. The rule: never leave a note orphaned — always link up. This should be a checklist item in any notes-creating workflow.
Status: No code fix. State note instructions may need strengthening.
During the first revision run, the gathering phase (using gpt-4.1-mini) followed the notes-first rule and attempted to load the article from notes. After notes lookup failed twice and was marked permanently failed, the model transitioned directly to working rather than trying search_emails in the Sent folder — which would have found relevant emails from the original article-writing task.
Two compounding factors: (a) the notes linkage gap (Issue 2) meant the notes lookup genuinely failed; (b) the mini model did not complete the full notes-first protocol specified in envoy/states/gathering, which says 'Only if not found in notes: search emails'. The model skipped the email fallback step.
Gathering completed without the article loaded. Working phase proceeded on partial context, then hallucinated completion (see Issue 5).
None yet. The primary fix was Issue 2 (notes linkage). With correct linkage, gathering can now follow CONTENTS → pwsafe → pwsafe/writeup/index → draft-v1 in one traversal.
The mini model (gpt-4.1-mini) used in gathering is less reliable at following multi-step conditional instructions than the full model. State note instructions for gathering should be more imperative and less conditional: instead of 'only if not found in notes', specify the full fallback sequence explicitly as numbered steps. Consider adding a gathering checklist: 'If creative/writing task AND no draft found in notes: ALWAYS search emails in Sent before proceeding.'
Status: Worked around with IMAP injection. Root cause unresolved.
Test emails sent via sendmail failed to deliver. Error: mx-caprica.zoneedit.com type=AAAA: Host not found. sendmail attempted an IPv6 MX lookup before IPv4, and the IPv6 lookup failed, aborting delivery.
sendmail's DNS resolver attempted AAAA (IPv6) records first. The DNS server (or the MX host's DNS) returned NXDOMAIN for the AAAA query, causing sendmail to report failure rather than falling back to A (IPv4) records.
Test emails injected via send_test_email.py appeared to succeed (exit code 0) but were silently queued rather than delivered. The orchestrator found 0 UNSEEN emails on its next run.
Injected emails directly into IMAP INBOX using IMAPClient.append() via a purpose-built script inject_email.py. This bypasses sendmail entirely and delivers directly into the test mailbox.
For testing Envoy, direct IMAP injection is more reliable than sendmail. inject_email.py should be a first-class test tool, not an emergency workaround. The sendmail IPv6 issue should be investigated separately (likely a sendmail.cf or resolv.conf configuration issue).
Status: Design limitation. No code fix. Requires architectural change to fully address.
In the first revision run, the working phase model (gpt-5.2-chat-latest) returned status=complete and claimed to have 'published the revised article to WebDAV at /AI/pwsafe-article.html'. No WebDAV action exists in Envoy's schema. No write_notes action was taken either. The model fabricated a plausible-sounding completion event.
The model had been given a rich description of the task (including a mention of the AI WebDAV folder from context) and produced a completion narrative that included actions it cannot actually perform. This is standard LLM hallucination: the model generates a plausible completion rather than acknowledging it cannot perform the requested action.
The reply email claimed success. The user received false confirmation. No note was written, no WebDAV file was created. The task was effectively lost — the orchestrator moved the email to Done and stopped.
None yet. Partial mitigation: the orchestrator logs all actions actually dispatched. A human reviewing the log can identify the discrepancy between the model's claim and what was actually executed.
The orchestrator should validate that status=complete is only permitted when at least one meaningful action was dispatched in the working phase (write_notes, send_email with substantive content, or similar). If status=complete but no actions were executed, force another iteration or escalate. This would detect the hallucination pattern automatically.
LLMs will claim to have performed actions that are outside their schema when the task narrative implies those actions are expected. The schema guards input (what actions are structurally possible) but does not prevent the model from hallucinating execution of actions not in the schema. Orchestrator-level verification — 'was something actually done?' — is necessary as a separate check.
Status: Partially improved. Design gap remains.
When write_notes fails (e.g. invalid JSON from the model), dispatch_immediate_actions() logs the failure to gather_results but continues executing the remaining actions in the same call. The model's send_email action then proceeds, sending a reply that claims the note was written successfully.
Actions are dispatched sequentially in a single pass. Failures are recorded as strings in gather_results for the model's next iteration, but in the same call the send_email action fires before any feedback loop can occur. When status=complete, there is no next iteration.
User receives a success reply email when the primary deliverable (the revised note) was not actually stored. The task appears complete but the output is lost.
Improved the error message in _parse_note_value to show the actual JSON parse error position and surrounding context, rather than just the first 200 characters. This makes failures much easier to diagnose.
When status=complete and a write_notes action fails, the orchestrator should suppress the success email and instead either: (a) force another iteration with the failure in context (so the model can retry), or (b) send a failure notification rather than a false success. This requires distinguishing 'terminal with all actions succeeded' from 'terminal with some actions failed'.
Action execution and reply composition happen in the same LLM response and cannot be separated by the current architecture. The orchestrator needs a post-execution check: 'did the critical actions actually succeed?' before allowing a terminal status email to be sent.
Status: Improved logging. Root cause unresolved.
The first attempt to write pwsafe/writeup/draft-v2 failed with a JSON parse error despite the first 200 characters being syntactically valid. An identical re-run 90 seconds later succeeded. The failure was non-deterministic.
Unknown. Likely candidates: (a) model non-determinism producing different JSON on each call (one valid, one not); (b) an unescaped character (quote, newline) somewhere deep in the long article content; (c) token truncation producing incomplete JSON. The old error message showed only the first 200 chars, which appeared fine, making diagnosis impossible.
Improved _parse_note_value to report the actual json.JSONDecodeError position, message, and context window (80 chars either side of the error position). Future failures will show exactly where the JSON breaks.
Error messages must include enough context to diagnose the problem without re-running. 'Not valid JSON' with 200 chars is insufficient for a document that may be 8,000+ chars. Always report the parse error position and surrounding context. For long model-generated JSON, consider a schema validation step post-parse.
The session highlighted the lack of dedicated debugging tools for Envoy. What was needed and what was available:
Needed: IMAP folder inspector. Available via diag_list_emails.py — adequate but project-specific, not a general tool. Should be elevated to a first-class utility.Needed: IMAP email injector. Not available. Created inject_email.py as an emergency workaround. Should be a permanent, documented test tool.Needed: LLM response viewer. No tool exists. When the model produces invalid output (bad JSON, hallucinated actions), there is no way to see what it actually returned without adding debug logging to the orchestrator.Needed: Note tree explorer. The only way to discover note linkage gaps is to manually traverse note by note. A tool that shows a note's subtree and flags orphaned notes would have caught Issue 2 immediately.Needed: Action execution log viewer. The orchestrator logs actions but the log can be hard to scan. A tool that summarises 'what actions were dispatched in run X' would speed up analysis.
1. Action success gate before terminal status — if any write_notes fails and status=complete, block the reply and force retry or send failure notification2. Complete-with-no-actions guard — if status=complete but no write_notes/meaningful send_email was dispatched in working phase, reject and force another iteration3. Gathering fallback sequence as explicit steps — state note for gathering should list numbered fallback steps, not conditional prose4. Notes-first protocol enforcement — gathering state note should require that email search is always attempted for tasks involving creative output before transitioning to working5. Parent note update checklist — state note for working should remind the model: 'if you create a new note subtree, update the parent index and CONTENTS'6. Message-ID normalisation in all comparisons — document this as a convention in the IMAP client module
What went well:— Systematic root cause analysis: traced each failure through logs, note contents, and code— Fixed the code bug immediately and committed it— Updated notes linkage before re-running, so the second run had correct data— Used diag_list_emails.py to inspect IMAP state directlyWhat could be improved:— Took too long to identify the sendmail failure: always check IMAP delivery before blaming orchestrator logic— The find / search was too broad and slow; should have used targeted paths— Error logging should have been improved before re-running, not after— Should have run diag_list_emails.py first after any 'no emails found' result, before resorting to manual Python scriptsWorkflow suggestion for future debugging:1. Run fails or produces wrong output2. Run diag_list_emails.py to confirm IMAP state3. Check orchestrator log for action dispatch entries4. Read the reply email that was sent (check Sent folder)5. Only then look at code if a code bug is suspected
Concrete improvements from this session — code and notes changesAll case studies