shinygen

Generate, evaluate, and refine Shiny apps using LLM agents in Docker sandboxes

shinygen turns a text prompt into a production-quality Shiny dashboard — generated, screenshot-tested, and quality-scored — all inside an isolated Docker sandbox.

Quick Install

git clone https://github.com/karangathe/shinygen.git
cd shinygen
pip install -e .

For screenshot-based evaluation, install the optional extra:

pip install -e ".[screenshot]"

Tip

Docker images with Shiny, Playwright, and Chromium are pulled automatically on first run.

Architecture

The current loop is generate inside a fresh sandbox → optionally self-evaluate visually → optionally judge externally → refine.

flowchart TD
  A[User Prompt + flags] --> B[shinygen CLI / API]
  B --> C[Pre-flight checks\nDocker + API key]
  C --> D[Load framework defaults,\ndata files, and skills]
  D --> E[Iteration loop]

  subgraph F[Fresh Docker sandbox per iteration]
    direction TB
    G[Stage Docker context\n+ Inspect AI task]
    H[LLM Agent]
    I[Generate app.py / app.R]
    J{--screenshot?}
    K[Run app on :8000\ninside sandbox]
    L[screenshot_helper.py\nwaits 7s and captures screenshot.png]
    M[Agent reviews screenshot\nand refines in sandbox]
    N[Scorer copies project files\nto results volume]
    O[Eval log + usage rows]

    E --> G --> H --> I --> J
    J -- Yes --> K --> L --> M --> N
    J -- No --> N
    N --> O
  end

  O --> P[Extract code from results volume\nor eval log]
  O --> Q[Copy agent_last_screenshot.png\nfrom results or eval attachment]
  P --> R{--judge-model?}
  R -- Yes --> S[Run extracted app on host temp dir]
  S --> T[Host Playwright screenshot]
  T --> U[External LLM judge]
  U --> V{Score ≥ threshold?}
  V -- No --> W[Refinement prompt includes\njudge feedback + previous code]
  W --> E
  V -- Yes --> X[Output directory]
  R -- No --> X

  Q --> X

  style A fill:#6c5ce7,color:#fff,stroke:none
  style B fill:#0984e3,color:#fff,stroke:none
  style C fill:#74b9ff,color:#fff,stroke:none
  style D fill:#636e72,color:#fff,stroke:none
  style F fill:#f8f9fa,stroke:#dee2e6,stroke-width:2px
  style H fill:#00b894,color:#fff,stroke:none
  style I fill:#fdcb6e,color:#2d3436,stroke:none
  style Q fill:#00cec9,color:#fff,stroke:none
  style U fill:#e17055,color:#fff,stroke:none
  style X fill:#00b894,color:#fff,stroke:none

How it works

Step-by-step walkthrough (click to expand)

You run a command — provide a prompt, pick a model, optionally enable --screenshot and --judge-model.
Pre-flight + setup — shinygen verifies Docker/API access, resolves the framework and model, and loads bundled skills, custom skills, and any data files.
Fresh sandbox generation — each iteration stages a fresh Docker sandbox, installs agent skills in the agent’s native home, and asks Claude Code or Codex CLI to generate app.py or app.R.
Visual self-evaluation — when screenshot mode is enabled, the agent runs the app inside the sandbox, uses screenshot_helper.py, waits 7 seconds before capture, and refines visually before submission.
Extraction + optional external judge — shinygen extracts the app from the results volume or eval log, preserves the final in-agent screenshot as agent_last_screenshot.png, and, if --judge-model is set, runs the extracted app on the host for a separate Playwright screenshot and external LLM score. Failed scores feed a refinement prompt that includes both judge feedback and the previous code.
Final output — the final app, screenshots, eval logs, and structured run summary are written to your output directory.

Iteration by Feedback

With max_iterations > 1, each round runs in a fresh sandbox. The refinement prompt carries forward judge feedback and the previous code, so the agent can improve incrementally rather than starting from scratch.

Original prompt (iteration 1):

Create a Shiny dashboard for clinical trials with filters, a trend chart, and a summary table.

Refinement prompt (iteration 2):

Create a Shiny dashboard for clinical trials with filters, a trend chart, and a summary table.

--- PREVIOUS CODE ---
<the full app.py from the previous iteration>

--- REFINEMENT (iteration 1) ---

A previous version of this app was evaluated and received the following scores.
Please improve the app to address the feedback:

  - Functionality: 7.0/10 — Filters work as expected.
  - Design: 5.5/10 — Layout is cramped and spacing is inconsistent.
  - Code Quality: 6.0/10 — Repeated logic in multiple places.
  - Ux: 6.2/10 — Table readability is weak.

Focus on improving the lowest-scoring areas while maintaining what already works well.
Produce the complete, improved app file.

What You Get

my-dashboard/
├── app.py                  # Or app.R for Shiny for R
├── data.csv                # Your data file (if provided)
├── screenshot.png          # Host-side screenshot used by the external judge
├── agent_last_screenshot.png
├── eval_logs/
│   └── *.eval
└── run_summary.json        # Structured score, usage, and artifact metadata

The CLI prints a results summary:

Score: 8.25 / 10.00 (after 2 iterations)
Time:  45.2s total (38.1s generate, 7.1s judge)
Tokens: 12,340 input / 3,210 output
Cost:  $0.1842

Features

Multiple LLM Agents

Claude Code (Anthropic) and Codex CLI (OpenAI) — pick the model that fits your budget and quality needs.

Docker Sandboxes

Every generation runs in an isolated container via Inspect AI. No side effects on your host.

R and Python

Generate Shiny for Python (app.py) or Shiny for R (app.R) from the same CLI.

Visual Self-Evaluation

The agent takes Playwright screenshots inside the sandbox, reviews them, and self-corrects layout/styling issues.

External LLM Judge

A separate model scores apps on functionality, design, code quality, and UX — then triggers refinement if needed.

Skills Injection

Pass custom skill files into the agent sandbox to control coding style, component choices, and best practices.

Web Fetch

Enabled by default. Allows the agent to search the web during generation for up-to-date API docs and examples. Use --no-web-fetch to disable.

Cost & Time Tracking

Token usage, dollar costs, and timing breakdowns reported per run.

Quick Start

CLI

# Generate a Shiny for Python app with Claude Sonnet
shinygen generate \
    --prompt "Create a sales dashboard with filters by region and product category" \
    --model claude-sonnet \
    --output ./my-dashboard

# Screenshot-based quality evaluation and iteration
shinygen generate \
    --prompt "Create a clinical trials dashboard" \
    --model claude-opus \
    --output ./trials-app \
    --screenshot \
    --judge-model claude-sonnet \
    --max-iterations 5

shinygen generate \
    --prompt "Create an interactive data explorer" \
    --model claude-sonnet \
    --framework shiny-r \
    --output ./r-app

shinygen generate \
    --prompt "Build a dashboard for this dataset" \
    --model gpt54 \
    --output ./my-app \
    --skills-dir ./my-skills/ \
    --csv-file ./sales.csv

web_fetch is enabled by default. Use --no-web-fetch to disable web search.

Python API

import shinygen

result = shinygen.generate(
    prompt="Create a sales dashboard with regional filters",
    model="claude-sonnet",
    output_dir="./my-dashboard",
    framework="shiny-python",
    data_csv="./sales.csv",
    screenshot=True,
    judge_model="claude-sonnet",
    max_iterations=5,
)

print(result.app_dir)       # ./my-dashboard
print(result.score)          # 4.5
print(result.iterations)    # 2

Batch Generation

Run the same prompt across multiple models (or different prompts with different settings) in a single command. Each job gets its own output directory.

Create a JSON config file. Each object is one job — same keys as the Python API. You can mix models, frameworks, and prompts freely.

batch.json

[
  {
    "prompt": "Create a clinical trials dashboard with enrollment trends and status filters",
    "model": "claude-sonnet",
    "output_dir": "./batch/sonnet",
    "csv_file": "./test_data_csv_files/trials_short_50.csv",
    "screenshot": true,
    "judge_model": "claude-sonnet",
    "max_iterations": 3
  },
  { "model": "gpt54",       "output_dir": "./batch/gpt54", "..." : "same prompt/settings" },
  { "model": "claude-opus",  "output_dir": "./batch/opus",  "..." : "same prompt/settings" }
]

Then run:

shinygen batch --config batch.json

Output:

Starting batch run with 3 job(s)...

Batch complete: 3 succeeded, 0 failed
  [job 1] ./batch/sonnet  score=8.50  iterations=2  passed=True
  [job 2] ./batch/gpt54   score=7.25  iterations=3  passed=True
  [job 3] ./batch/opus    score=9.00  iterations=1  passed=True

import shinygen

results = shinygen.batch([
    {
        "prompt": "Clinical trials dashboard with filters and charts",
        "model": "claude-sonnet",
        "output_dir": "./batch/sonnet",
        "data_csv": "./test_data_csv_files/trials_short_50.csv",
        "screenshot": True,
        "judge_model": "claude-sonnet",
    },
    # Add more jobs with different models/prompts...
])

print(f"{results.succeeded} succeeded, {results.failed} failed")
for r in results.results:
    print(f"  {r.app_dir}  score={r.score:.2f}")

Batch tips

Jobs run sequentially (no Docker resource contention).
A failed job doesn’t stop the rest.
Use different output_dir values per job.
Relative paths resolve from the config file’s directory.

Supported Models

Alias	Agent	Model ID	Provider
Supported Models
`claude-opus`	Claude Code	`claude-opus-4-6`	Anthropic
`claude-sonnet`	Claude Code	`claude-sonnet-4-6`	Anthropic
`gpt54`	Codex CLI	`gpt-5.4`	OpenAI
`gpt54-mini`	Codex CLI	`gpt-5.4-mini-2026-03-17`	OpenAI
`codex-gpt53`	Codex CLI	`gpt-5.3-codex`	OpenAI

Cost Comparison

Per 1M tokens. Actual cost per dashboard is typically $0.01–$0.30 depending on model, prompt complexity, and iterations.

Model	Input ($/1M tokens)	Output ($/1M tokens)
Cost per 1M Tokens
USD pricing by model
`claude-opus`	$15.00	$75.00
`claude-sonnet`	$3.00	$15.00
`gpt54`	$2.50	$10.00
`gpt54-mini`	$0.30	$1.20
`codex-gpt53`	$2.50	$10.00

Data Inputs

Option	CLI	Python API	Notes
Data Input Options
Single CSV	`--csv-file`	`data_csv`	Convenience shorthand
Multiple files	`--data-file` (repeatable)	`data_files`	Any file type
If both provide the same filename, `--csv-file` takes precedence.

Requirements

Requirement	Details
Prerequisites
Python	3.10+
Docker	Running daemon
API Key	ANTHROPIC_API_KEY or OPENAI_API_KEY
`export ANTHROPIC_API_KEY='sk-ant-...'` or `export OPENAI_API_KEY='sk-...'`

License

MIT

Quick Install

Architecture

How it works

What You Get

Features

Multiple LLM Agents

Docker Sandboxes

R and Python

Visual Self-Evaluation

External LLM Judge

Iterative Refinement

Skills Injection

Web Fetch

Cost & Time Tracking

Quick Start

CLI

Python API

Batch Generation

Supported Models

Cost Comparison

Data Inputs

Requirements

License