Ask - Architecture and Data Flow

High-Level Flow

1. User invokes ask_with_files_structured.py with files, question, and optional system prompt/schema

2. Script extracts text from all input files using format-specific parsers

3. Prepares formatted message with file contents, system prompt, and question

4. Sends to OpenAI API (via openai Python library)

5. If schema provided, OpenAI enforces it server-side via response_format (not validated locally)

6. Outputs response to stdout or file

Component: File Processing

Each supported file type has a dedicated extraction function:

• extract_text_from_txt() — Plain text, code, log, CSV, JSON, XML

• extract_text_from_docx() — Microsoft Word documents (via python-docx)

• extract_text_from_pdf() — PDF files (tries pdfplumber, pypdf, PyPDF2 in order)

• extract_text_from_odt() — OpenDocument text (via odfpy)

• extract_text_from_html() — HTML parsing with tag/script/style removal (via BeautifulSoup)

• extract_text_from_markdown() — Markdown → HTML → text conversion (via markdown + BeautifulSoup)

Non-text files (images, audio) return None; upstream provides metadata placeholder to LLM. Duplicate filenames are rejected (ValueError) since LLM references by filename.

Component: OpenAI Integration

API Key retrieval (in priority order):

1. OPENAI_API_KEY environment variable

2. ~/.env_data GDBM file via gdata library (keys: 'api_key' or 'OPENAI_API_KEY')

Note: gdata is a custom GDBM wrapper library (from the gdata-server project, not Google gdata).

Library: openai Python package

Models: Configurable via -m flag, default gpt-4o-mini

Request format: Chat completion with system + user messages

Structured output: Optional response_format using JSON schema (enforced by OpenAI server-side)

Component: Structured Output (Schema)

Three ways to specify schema (mutually exclusive):

1. --schema '{json}' — Direct JSON string on command line

2. --schema-file path.json — JSON file

3. --schema-yaml path.yaml — YAML file (converted to JSON)

Schema is passed to OpenAI as response_format parameter. OpenAI enforces schema compliance server-side — ask does not validate the response. YAML is superset of JSON so JSON files can be passed to --schema-yaml.

Component: Document References

Default mode:

• Wraps each file in ===== filename ===== headers

• Prefixes message: 'I am providing you with N document(s) below'

• Post-question instruction: 'reference them by filename'

With --no-document-references:

• No headers, no preamble, no referencing instruction

• Content and question are concatenated directly

• Used by jobs workflow for clean structured JSON output

main() as Library Function

main() accepts all parameters as keyword arguments and returns the response string. Can be imported and called programmatically — not limited to CLI usage. Progress messages go to stderr; only the response goes to stdout.

version 3