Web Framework Evaluation

Generating tip calculator apps across 4 Python web frameworks with a single LLM prompt

Prompt & Model

Generator model: claude-sonnet-4-6 via chatlas ChatAnthropic with tool_web_search() enabled

Evaluator model: claude-sonnet-4-6 via Inspect AI

Prompt given to the LLM to generate each app:

Create an app using {framework} that is a basic tip calculator which features a clean interface with input fields for the bill amount and tip percentage (with 15%, 18%, or 20% presets). It must instantly calculate and display the tip amount and the total bill (bill + tip), with an optional, useful feature to split the total among multiple people. Do not use extra packages or CSS or JavaScript other than what {framework} already provides.

The {framework} placeholder was filled with each of: Streamlit, Plotly Dash, Panel, and Shiny for Python.

Frameworks Generated

Framework	Directory	Run Command
Streamlit	`streamlit/`	`streamlit run streamlit/app.py`
Plotly Dash	`dash/`	`python dash/app.py`
Panel	`panel/`	`panel serve panel/app.py`
Shiny for Python	`shiny/`	`shiny run shiny/app.py`

Test Scenario

Each app was started, given a bill of $85.50 with a 20% tip split among 3 people.

Before & After Screenshots

Streamlit

Plotly Dash

Panel

Shiny for Python

Evaluation with Inspect AI

Each generated app was scored by claude-sonnet-4-6 (via Inspect AI) on three criteria (1–10 scale). The scorer received the full app source code plus before/after screenshots:

Maintainability – Is the code well-structured, modular, and easy to extend?
Readability – Is it intuitive to read and idiomatic for the framework?
Requirement Adherence – Does it meet every specification and do the screenshots confirm it works?

Score Summary

Framework	Maintainability	Readability	Adherence
Streamlit	8	9	7
Plotly Dash	8	8	10
Panel	7	7	9
Shiny for Python	9	9	10
Mean	8.00	8.25	9.00

Streamlit – Evaluation

I’ll analyze the app code and screenshots across the three criteria.

Criterion 1: Maintainability (Score: 8/10)

Strengths:

Code is well-organized with clear section comments using visual separators (# ──)
Logical flow: config → inputs → calculations → outputs → split feature
Session state is used appropriately for the tip percentage persistence
Calculations are isolated into simple, clear formulas
Good use of columns for layout consistency

Weaknesses:

The summary caption at the bottom references num_people which could cause a NameError if split_on is True but the variable is somehow not defined (minor scoping risk)
No functions/helpers — everything is inline, which makes future refactoring slightly harder
The “Splitting … among 2 people” bug (visible in after screenshot) indicates a display logic issue

Criterion 2: Readability (Score: 9/10)

Strengths:

Excellent inline comments with clear section headers
Descriptive variable names (bill_amount, tip_pct, per_person_total, etc.)
Follows idiomatic Streamlit patterns (session state, columns, metrics)
Help text on inputs adds to self-documentation
Logical top-to-bottom flow mirrors the UI layout

Weaknesses:

Minor: the session state sync pattern (setting after slider) is slightly confusing since the buttons set it before and the slider reads it — understandable but not immediately obvious

Criterion 3: Requirement Adherence (Score: 8/10)

Strengths:

✅ Input field for bill amount
✅ 15%, 18%, 20% preset buttons are present
✅ Custom tip slider also available
✅ Instant calculation of tip amount and total bill displayed
✅ Bill splitting feature implemented
✅ No extra packages, CSS, or JavaScript used
✅ Before screenshot confirms clean default state

Weaknesses:

❌ Bug confirmed in after screenshot: “Splitting $102.60 among 2 people” is shown even though 3 people are entered (the number input shows 3 but the text says 2 — likely a Streamlit re-run/state timing issue where num_people lags one interaction)
The split toggle adds a small UX friction — could default to showing the split section more prominently
The “20% — Great” button in the after screenshot appears slightly highlighted but there’s no clear visual indicator of the selected preset

Plotly Dash – Evaluation

I’ll evaluate this Plotly Dash tip calculator app across the three criteria.

Criterion 1: Maintainability

Strengths:

Style constants are centralized at the top using named color variables (PRIMARY, CARD_BG, etc.), making global theme changes easy
The two callbacks are well-separated by concern (one for syncing presets, one for calculations)
The layout is logically organized into clearly named sections (Bill Amount, Tip Percentage, Results, Split)
Style dictionaries use composition ({**preset_btn_base, ...}) to reduce duplication
Good use of constants to avoid magic numbers/strings

Weaknesses:

The large number of style dictionaries (15+) at the module level creates a long file that’s somewhat monolithic — a larger project would benefit from separating styles
The sync_preset callback imports ctx inside the function rather than at module level
Some mild logic duplication in the btn_style inner function

Score: 8/10

Criterion 2: Readability

Strengths:

Excellent use of section comments (# ── Bill Amount ──...) that visually separate layout blocks
Variable names are descriptive and self-documenting (bill_err, tip_amount, per_person_str)
The preset_map dictionary elegantly maps button IDs to values
The file-level comment “Tip Calculator — Plotly Dash App” and section headers guide readers well
Callback logic is clean and follows a clear validate-then-compute pattern

Weaknesses:

The density of style dictionaries near the top is visually overwhelming and may slow down reading the actual app logic
Minor: from dash import ctx inside the callback is unconventional and could confuse readers expecting all imports at the top

Score: 8/10

Criterion 3: Requirement Adherence

Strengths:

✅ Input field for bill amount with proper number input
✅ Input field for tip percentage (custom)
✅ All three preset buttons (15%, 18%, 20%) present and functional
✅ Active preset button highlights correctly (20% shown filled/blue in “after” screenshot)
✅ Instant calculation — tip amount ($17.10) and total bill ($102.60) shown correctly for $85.50 @ 20%
✅ Split among multiple people works correctly ($102.60 ÷ 3 = $34.20 confirmed in screenshot)
✅ No external CSS files, JS files, or extra packages used — only Dash’s built-in inline styles
✅ Input validation with error messages for edge cases
✅ Clean, professional interface

Weaknesses:

Essentially none — all requirements are fully met and verified by screenshots

Score: 10/10

Panel – Evaluation

I’ll evaluate this Panel tip calculator app on three criteria.

Criterion 1: Maintainability (Score: 7/10)

Strengths:

The code is well-organized into logical sections (color palette, widgets, callbacks, result pane, layout) with clear section comments using ── separators
The _result_card function centralizes all calculation logic cleanly
The _update_presets helper avoids repetition across the three preset callbacks
The row() inner function inside _result_card is a nice reusable pattern for result rows
Color constants are defined at the top, making theme changes easy

Weaknesses:

The large color palette block (12+ constants) at the top is somewhat verbose and could be a dict or dataclass
Inline HTML strings with embedded f-strings for styling make future modifications tedious
section_label as a lambda is slightly unconventional (lambdas aren’t ideal for reusable UI components)
The layout nesting is fairly deep, which could complicate future restructuring

Criterion 2: Readability (Score: 7/10)

Strengths:

Section headers with ── delimiters make it easy to navigate
Function names are descriptive (on_preset_15, _update_presets, _result_card)
The pn.bind reactive pattern is idiomatic Panel
Logic flow is easy to follow top-to-bottom

Weaknesses:

No docstrings or comments explaining the calculation logic (though it’s simple math)
Heavy use of inline HTML/CSS strings within Python reduces readability
The _result_card function is fairly long and mixes layout construction with business logic
Color constant names like CLR_PRESET_TXT_ON and CLR_PRESET_TXT_OFF are defined but never actually used in the code

Criterion 3: Requirement Adherence (Score: 9/10)

Strengths:

✅ Input field for bill amount (FloatInput)
✅ Tip percentage input with slider
✅ All three presets (15%, 18%, 20%) implemented as buttons with visual highlighting
✅ Instant calculation via pn.bind — tip amount and total bill displayed reactively
✅ Bill splitting feature works correctly (÷3 shown in after screenshot)
✅ Screenshots confirm correct math: $85.50 × 20% = $17.10 tip, $102.60 total, $34.20 per person
✅ No extra packages beyond Panel used
✅ Preset button highlighting updates correctly (20% shown as green in after screenshot)

Minor issues:

Uses custom CSS via styles= dict and inline HTML for styling — this is borderline given the “no extra CSS” requirement, though these are Panel-native mechanisms rather than external stylesheets
Two unused color constants (CLR_PRESET_ON, CLR_PRESET_OFF, CLR_PRESET_TXT_ON, CLR_PRESET_TXT_OFF) defined but not referenced

Shiny for Python – Evaluation

I’ll evaluate this Shiny for Python tip calculator app on the three criteria.

Criterion 1: Maintainability (Score: 9/10)

Strengths:

Well-structured separation between UI and server logic with clear section comments
Single calc() reactive that centralizes all computation — easy to modify formulas in one place
Logical grouping of UI elements into cards (inputs, results, split)
Preset button handlers are clean and consistent (three separate @reactive.effect handlers)
Good use of max(int(...), 1) guard to prevent division by zero
Named outputs are clearly mapped to their display elements

Minor concerns:

The three preset handlers are slightly repetitive; a helper function could reduce boilerplate, but at this scale it’s acceptable
The calc() dictionary approach works but a dataclass/namedtuple would be more maintainable at larger scale

Criterion 2: Readability (Score: 9/10)

Strengths:

Excellent use of section comments (# ── Header ──, # ── Server ──, etc.) for visual scanning
Variable names (bill, tip_pct, num_people, per_person_tip) are highly descriptive
Follows idiomatic Shiny for Python patterns (@reactive.calc, @render.text, @reactive.effect)
The UI structure mirrors what the user sees, making it easy to map code to visual output
Arithmetic is straightforward and easy to follow

Minor concerns:

_apply_15, _apply_18, _apply_20 private naming convention is fine but the underscore prefix may confuse some readers
A brief comment explaining the or 0.0 / or 1 fallback pattern would help

Criterion 3: Requirement Adherence (Score: 10/10)

Strengths:

✅ Input field for bill amount (numeric, default $50, step $0.01)
✅ Input field for tip percentage (numeric, editable)
✅ All three presets present: 15%, 18%, 20% as action buttons
✅ Instant calculation (reactive, no submit button needed)
✅ Displays tip amount ($17.10 for $85.50 × 20% ✓)
✅ Displays total bill ($102.60 = $85.50 + $17.10 ✓)
✅ Bill splitting among N people ($34.20 = $102.60 ÷ 3 ✓)
✅ No external CSS, JavaScript, or extra packages used
✅ Screenshots confirm correct calculations for the test case
The bonus “split tip only” per person is a nice addition

All calculations verified against the after screenshot are mathematically correct.