1. User invokes ask_with_files_structured.py with files, question, and optional system prompt/schema
2. Script extracts text from all input files using format-specific parsers
3. Prepares formatted message with file contents, system prompt, and question
4. Sends to OpenAI API (via openai Python library)
5. If schema provided, OpenAI enforces it server-side via response_format (not validated locally)
6. Outputs response to stdout or file
Each supported file type has a dedicated extraction function:
• extract_text_from_txt() — Plain text, code, log, CSV, JSON, XML
• extract_text_from_docx() — Microsoft Word documents (via python-docx)
• extract_text_from_pdf() — PDF files (tries pdfplumber, pypdf, PyPDF2 in order)
• extract_text_from_odt() — OpenDocument text (via odfpy)
• extract_text_from_html() — HTML parsing with tag/script/style removal (via BeautifulSoup)
• extract_text_from_markdown() — Markdown → HTML → text conversion (via markdown + BeautifulSoup)
Non-text files (images, audio) return None; upstream provides metadata placeholder to LLM. Duplicate filenames are rejected (ValueError) since LLM references by filename.
API Key retrieval (in priority order):
1. OPENAI_API_KEY environment variable
2. ~/.env_data GDBM file via gdata library (keys: 'api_key' or 'OPENAI_API_KEY')
Note: gdata is a custom GDBM wrapper library (from the gdata-server project, not Google gdata).
Library: openai Python package
Models: Configurable via -m flag, default gpt-4o-mini
Request format: Chat completion with system + user messages
Structured output: Optional response_format using JSON schema (enforced by OpenAI server-side)
Three ways to specify schema (mutually exclusive):
1. --schema '{json}' — Direct JSON string on command line
2. --schema-file path.json — JSON file
3. --schema-yaml path.yaml — YAML file (converted to JSON)
Schema is passed to OpenAI as response_format parameter. OpenAI enforces schema compliance server-side — ask does not validate the response. YAML is superset of JSON so JSON files can be passed to --schema-yaml.
Default mode:
• Wraps each file in ===== filename ===== headers
• Prefixes message: 'I am providing you with N document(s) below'
• Post-question instruction: 'reference them by filename'
With --no-document-references:
• No headers, no preamble, no referencing instruction
• Content and question are concatenated directly
• Used by jobs workflow for clean structured JSON output
main() accepts all parameters as keyword arguments and returns the response string. Can be imported and called programmatically — not limited to CLI usage. Progress messages go to stderr; only the response goes to stdout.