flowchart TD
A[User Prompt + flags] --> B[shinygen CLI / API]
B --> C[Pre-flight checks\nDocker + API key]
C --> D[Load framework defaults,\ndata files, and skills]
D --> E[Iteration loop]
subgraph F[Fresh Docker sandbox per iteration]
direction TB
G[Stage Docker context\n+ Inspect AI task]
H[LLM Agent]
I[Generate app.py / app.R]
J{--screenshot?}
K[Run app on :8000\ninside sandbox]
L[screenshot_helper.py\nwaits 7s and captures screenshot.png]
M[Agent reviews screenshot\nand refines in sandbox]
N[Scorer copies project files\nto results volume]
O[Eval log + usage rows]
E --> G --> H --> I --> J
J -- Yes --> K --> L --> M --> N
J -- No --> N
N --> O
end
O --> P[Extract code from results volume\nor eval log]
O --> Q[Copy agent_last_screenshot.png\nfrom results or eval attachment]
P --> R{--judge-model?}
R -- Yes --> S[Run extracted app on host temp dir]
S --> T[Host Playwright screenshot]
T --> U[External LLM judge]
U --> V{Score ≥ threshold?}
V -- No --> W[Refinement prompt includes\njudge feedback + previous code]
W --> E
V -- Yes --> X[Output directory]
R -- No --> X
Q --> X
style A fill:#6c5ce7,color:#fff,stroke:none
style B fill:#0984e3,color:#fff,stroke:none
style C fill:#74b9ff,color:#fff,stroke:none
style D fill:#636e72,color:#fff,stroke:none
style F fill:#f8f9fa,stroke:#dee2e6,stroke-width:2px
style H fill:#00b894,color:#fff,stroke:none
style I fill:#fdcb6e,color:#2d3436,stroke:none
style Q fill:#00cec9,color:#fff,stroke:none
style U fill:#e17055,color:#fff,stroke:none
style X fill:#00b894,color:#fff,stroke:none
shinygen
Generate, evaluate, and refine Shiny apps using LLM agents in Docker sandboxes
shinygen turns a text prompt into a production-quality Shiny dashboard — generated, screenshot-tested, and quality-scored — all inside an isolated Docker sandbox.
Quick Install
git clone https://github.com/karangathe/shinygen.git
cd shinygen
pip install -e .For screenshot-based evaluation, install the optional extra:
pip install -e ".[screenshot]"Docker images with Shiny, Playwright, and Chromium are pulled automatically on first run.
Architecture
The current loop is generate inside a fresh sandbox → optionally self-evaluate visually → optionally judge externally → refine.
How it works
- You run a command — provide a prompt, pick a model, optionally enable
--screenshotand--judge-model. - Pre-flight + setup — shinygen verifies Docker/API access, resolves the framework and model, and loads bundled skills, custom skills, and any data files.
- Fresh sandbox generation — each iteration stages a fresh Docker sandbox, installs agent skills in the agent’s native home, and asks Claude Code or Codex CLI to generate
app.pyorapp.R. - Visual self-evaluation — when screenshot mode is enabled, the agent runs the app inside the sandbox, uses
screenshot_helper.py, waits 7 seconds before capture, and refines visually before submission. - Extraction + optional external judge — shinygen extracts the app from the results volume or eval log, preserves the final in-agent screenshot as
agent_last_screenshot.png, and, if--judge-modelis set, runs the extracted app on the host for a separate Playwright screenshot and external LLM score. Failed scores feed a refinement prompt that includes both judge feedback and the previous code. - Final output — the final app, screenshots, eval logs, and structured run summary are written to your output directory.
With max_iterations > 1, each round runs in a fresh sandbox. The refinement prompt carries forward judge feedback and the previous code, so the agent can improve incrementally rather than starting from scratch.
Original prompt (iteration 1):
Create a Shiny dashboard for clinical trials with filters, a trend chart, and a summary table.
Refinement prompt (iteration 2):
Create a Shiny dashboard for clinical trials with filters, a trend chart, and a summary table.
--- PREVIOUS CODE ---
<the full app.py from the previous iteration>
--- REFINEMENT (iteration 1) ---
A previous version of this app was evaluated and received the following scores.
Please improve the app to address the feedback:
- Functionality: 7.0/10 — Filters work as expected.
- Design: 5.5/10 — Layout is cramped and spacing is inconsistent.
- Code Quality: 6.0/10 — Repeated logic in multiple places.
- Ux: 6.2/10 — Table readability is weak.
Focus on improving the lowest-scoring areas while maintaining what already works well.
Produce the complete, improved app file.
What You Get
my-dashboard/
├── app.py # Or app.R for Shiny for R
├── data.csv # Your data file (if provided)
├── screenshot.png # Host-side screenshot used by the external judge
├── agent_last_screenshot.png
├── eval_logs/
│ └── *.eval
└── run_summary.json # Structured score, usage, and artifact metadata
The CLI prints a results summary:
Score: 8.25 / 10.00 (after 2 iterations)
Time: 45.2s total (38.1s generate, 7.1s judge)
Tokens: 12,340 input / 3,210 output
Cost: $0.1842
Features
Multiple LLM Agents
Claude Code (Anthropic) and Codex CLI (OpenAI) — pick the model that fits your budget and quality needs.
Docker Sandboxes
Every generation runs in an isolated container via Inspect AI. No side effects on your host.
R and Python
Generate Shiny for Python (app.py) or Shiny for R (app.R) from the same CLI.
Visual Self-Evaluation
The agent takes Playwright screenshots inside the sandbox, reviews them, and self-corrects layout/styling issues.
External LLM Judge
A separate model scores apps on functionality, design, code quality, and UX — then triggers refinement if needed.
Iterative Refinement
Automatically re-generate until the quality threshold is met, up to --max-iterations.
Skills Injection
Pass custom skill files into the agent sandbox to control coding style, component choices, and best practices.
Web Fetch
Enabled by default. Allows the agent to search the web during generation for up-to-date API docs and examples. Use --no-web-fetch to disable.
Cost & Time Tracking
Token usage, dollar costs, and timing breakdowns reported per run.
Quick Start
CLI
# Generate a Shiny for Python app with Claude Sonnet
shinygen generate \
--prompt "Create a sales dashboard with filters by region and product category" \
--model claude-sonnet \
--output ./my-dashboard# Screenshot-based quality evaluation and iteration
shinygen generate \
--prompt "Create a clinical trials dashboard" \
--model claude-opus \
--output ./trials-app \
--screenshot \
--judge-model claude-sonnet \
--max-iterations 5shinygen generate \
--prompt "Create an interactive data explorer" \
--model claude-sonnet \
--framework shiny-r \
--output ./r-appshinygen generate \
--prompt "Build a dashboard for this dataset" \
--model gpt54 \
--output ./my-app \
--skills-dir ./my-skills/ \
--csv-file ./sales.csvweb_fetch is enabled by default. Use --no-web-fetch to disable web search.
Python API
import shinygen
result = shinygen.generate(
prompt="Create a sales dashboard with regional filters",
model="claude-sonnet",
output_dir="./my-dashboard",
framework="shiny-python",
data_csv="./sales.csv",
screenshot=True,
judge_model="claude-sonnet",
max_iterations=5,
)
print(result.app_dir) # ./my-dashboard
print(result.score) # 4.5
print(result.iterations) # 2Batch Generation
Run the same prompt across multiple models (or different prompts with different settings) in a single command. Each job gets its own output directory.
Create a JSON config file. Each object is one job — same keys as the Python API. You can mix models, frameworks, and prompts freely.
batch.json
[
{
"prompt": "Create a clinical trials dashboard with enrollment trends and status filters",
"model": "claude-sonnet",
"output_dir": "./batch/sonnet",
"csv_file": "./test_data_csv_files/trials_short_50.csv",
"screenshot": true,
"judge_model": "claude-sonnet",
"max_iterations": 3
},
{ "model": "gpt54", "output_dir": "./batch/gpt54", "..." : "same prompt/settings" },
{ "model": "claude-opus", "output_dir": "./batch/opus", "..." : "same prompt/settings" }
]Then run:
shinygen batch --config batch.jsonOutput:
Starting batch run with 3 job(s)...
Batch complete: 3 succeeded, 0 failed
[job 1] ./batch/sonnet score=8.50 iterations=2 passed=True
[job 2] ./batch/gpt54 score=7.25 iterations=3 passed=True
[job 3] ./batch/opus score=9.00 iterations=1 passed=True
import shinygen
results = shinygen.batch([
{
"prompt": "Clinical trials dashboard with filters and charts",
"model": "claude-sonnet",
"output_dir": "./batch/sonnet",
"data_csv": "./test_data_csv_files/trials_short_50.csv",
"screenshot": True,
"judge_model": "claude-sonnet",
},
# Add more jobs with different models/prompts...
])
print(f"{results.succeeded} succeeded, {results.failed} failed")
for r in results.results:
print(f" {r.app_dir} score={r.score:.2f}")- Jobs run sequentially (no Docker resource contention).
- A failed job doesn’t stop the rest.
- Use different
output_dirvalues per job. - Relative paths resolve from the config file’s directory.
Supported Models
| Supported Models | |||
| Alias | Agent | Model ID | Provider |
|---|---|---|---|
claude-opus |
Claude Code | claude-opus-4-6 |
Anthropic |
claude-sonnet |
Claude Code | claude-sonnet-4-6 |
Anthropic |
gpt54 |
Codex CLI | gpt-5.4 |
OpenAI |
gpt54-mini |
Codex CLI | gpt-5.4-mini-2026-03-17 |
OpenAI |
codex-gpt53 |
Codex CLI | gpt-5.3-codex |
OpenAI |
Cost Comparison
Per 1M tokens. Actual cost per dashboard is typically $0.01–$0.30 depending on model, prompt complexity, and iterations.
| Cost per 1M Tokens | ||
| USD pricing by model | ||
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
claude-opus |
$15.00 | $75.00 |
claude-sonnet |
$3.00 | $15.00 |
gpt54 |
$2.50 | $10.00 |
gpt54-mini |
$0.30 | $1.20 |
codex-gpt53 |
$2.50 | $10.00 |
Data Inputs
| Data Input Options | |||
| Option | CLI | Python API | Notes |
|---|---|---|---|
| Single CSV | --csv-file |
data_csv |
Convenience shorthand |
| Multiple files | --data-file (repeatable) |
data_files |
Any file type |
If both provide the same filename, --csv-file takes precedence. |
|||
Requirements
| Prerequisites | |
| Requirement | Details |
|---|---|
| Python | 3.10+ |
| Docker | Running daemon |
| API Key | ANTHROPIC_API_KEY or OPENAI_API_KEY |
export ANTHROPIC_API_KEY='sk-ant-...' or export OPENAI_API_KEY='sk-...' |
|