Web Framework Evaluation
Generating tip calculator apps across 4 Python web frameworks with a single LLM prompt
Prompt & Model
Generator model: claude-sonnet-4-6 via chatlas ChatAnthropic with tool_web_search() enabled
Evaluator model: claude-sonnet-4-6 via Inspect AI
Prompt given to the LLM to generate each app:
Create an app using {framework} that is a basic tip calculator which features a clean interface with input fields for the bill amount and tip percentage (with 15%, 18%, or 20% presets). It must instantly calculate and display the tip amount and the total bill (bill + tip), with an optional, useful feature to split the total among multiple people. Do not use extra packages or CSS or JavaScript other than what {framework} already provides.
The {framework} placeholder was filled with each of: Streamlit, Plotly Dash, Panel, and Shiny for Python.
Frameworks Generated
| Framework | Directory | Run Command |
|---|---|---|
| Streamlit | streamlit/ |
streamlit run streamlit/app.py |
| Plotly Dash | dash/ |
python dash/app.py |
| Panel | panel/ |
panel serve panel/app.py |
| Shiny for Python | shiny/ |
shiny run shiny/app.py |
Test Scenario
Each app was started, given a bill of $85.50 with a 20% tip split among 3 people.
Before & After Screenshots
Streamlit
Plotly Dash
Panel
Shiny for Python
Evaluation with Inspect AI
Each generated app was scored by claude-sonnet-4-6 (via Inspect AI) on three criteria (1–10 scale). The scorer received the full app source code plus before/after screenshots:
- Maintainability – Is the code well-structured, modular, and easy to extend?
- Readability – Is it intuitive to read and idiomatic for the framework?
- Requirement Adherence – Does it meet every specification and do the screenshots confirm it works?
Score Summary
| Framework | Maintainability | Readability | Adherence |
|---|---|---|---|
| Streamlit | 8 | 9 | 7 |
| Plotly Dash | 8 | 8 | 10 |
| Panel | 7 | 7 | 9 |
| Shiny for Python | 9 | 9 | 10 |
| Mean | 8.00 | 8.25 | 9.00 |
Streamlit – Evaluation
I’ll analyze the app code and screenshots across the three criteria.
Criterion 1: Maintainability (Score: 8/10)
Strengths:
- Code is well-organized with clear section comments using visual separators (
# ──) - Logical flow: config → inputs → calculations → outputs → split feature
- Session state is used appropriately for the tip percentage persistence
- Calculations are isolated into simple, clear formulas
- Good use of columns for layout consistency
Weaknesses:
- The summary caption at the bottom references
num_peoplewhich could cause aNameErrorifsplit_onis True but the variable is somehow not defined (minor scoping risk) - No functions/helpers — everything is inline, which makes future refactoring slightly harder
- The “Splitting … among 2 people” bug (visible in after screenshot) indicates a display logic issue
Criterion 2: Readability (Score: 9/10)
Strengths:
- Excellent inline comments with clear section headers
- Descriptive variable names (
bill_amount,tip_pct,per_person_total, etc.) - Follows idiomatic Streamlit patterns (session state, columns, metrics)
- Help text on inputs adds to self-documentation
- Logical top-to-bottom flow mirrors the UI layout
Weaknesses:
- Minor: the session state sync pattern (setting after slider) is slightly confusing since the buttons set it before and the slider reads it — understandable but not immediately obvious
Criterion 3: Requirement Adherence (Score: 8/10)
Strengths:
- ✅ Input field for bill amount
- ✅ 15%, 18%, 20% preset buttons are present
- ✅ Custom tip slider also available
- ✅ Instant calculation of tip amount and total bill displayed
- ✅ Bill splitting feature implemented
- ✅ No extra packages, CSS, or JavaScript used
- ✅ Before screenshot confirms clean default state
Weaknesses:
- ❌ Bug confirmed in after screenshot: “Splitting $102.60 among 2 people” is shown even though 3 people are entered (the number input shows 3 but the text says 2 — likely a Streamlit re-run/state timing issue where
num_peoplelags one interaction) - The split toggle adds a small UX friction — could default to showing the split section more prominently
- The “20% — Great” button in the after screenshot appears slightly highlighted but there’s no clear visual indicator of the selected preset
Plotly Dash – Evaluation
I’ll evaluate this Plotly Dash tip calculator app across the three criteria.
Criterion 1: Maintainability
Strengths:
- Style constants are centralized at the top using named color variables (
PRIMARY,CARD_BG, etc.), making global theme changes easy - The two callbacks are well-separated by concern (one for syncing presets, one for calculations)
- The layout is logically organized into clearly named sections (Bill Amount, Tip Percentage, Results, Split)
- Style dictionaries use composition (
{**preset_btn_base, ...}) to reduce duplication - Good use of constants to avoid magic numbers/strings
Weaknesses:
- The large number of style dictionaries (15+) at the module level creates a long file that’s somewhat monolithic — a larger project would benefit from separating styles
- The
sync_presetcallback importsctxinside the function rather than at module level - Some mild logic duplication in the
btn_styleinner function
Score: 8/10
Criterion 2: Readability
Strengths:
- Excellent use of section comments (
# ── Bill Amount ──...) that visually separate layout blocks - Variable names are descriptive and self-documenting (
bill_err,tip_amount,per_person_str) - The
preset_mapdictionary elegantly maps button IDs to values - The file-level comment “Tip Calculator — Plotly Dash App” and section headers guide readers well
- Callback logic is clean and follows a clear validate-then-compute pattern
Weaknesses:
- The density of style dictionaries near the top is visually overwhelming and may slow down reading the actual app logic
- Minor:
from dash import ctxinside the callback is unconventional and could confuse readers expecting all imports at the top
Score: 8/10
Criterion 3: Requirement Adherence
Strengths:
- ✅ Input field for bill amount with proper number input
- ✅ Input field for tip percentage (custom)
- ✅ All three preset buttons (15%, 18%, 20%) present and functional
- ✅ Active preset button highlights correctly (20% shown filled/blue in “after” screenshot)
- ✅ Instant calculation — tip amount ($17.10) and total bill ($102.60) shown correctly for $85.50 @ 20%
- ✅ Split among multiple people works correctly ($102.60 ÷ 3 = $34.20 confirmed in screenshot)
- ✅ No external CSS files, JS files, or extra packages used — only Dash’s built-in inline styles
- ✅ Input validation with error messages for edge cases
- ✅ Clean, professional interface
Weaknesses:
- Essentially none — all requirements are fully met and verified by screenshots
Score: 10/10
Panel – Evaluation
I’ll evaluate this Panel tip calculator app on three criteria.
Criterion 1: Maintainability (Score: 7/10)
Strengths:
- The code is well-organized into logical sections (color palette, widgets, callbacks, result pane, layout) with clear section comments using
──separators - The
_result_cardfunction centralizes all calculation logic cleanly - The
_update_presetshelper avoids repetition across the three preset callbacks - The
row()inner function inside_result_cardis a nice reusable pattern for result rows - Color constants are defined at the top, making theme changes easy
Weaknesses:
- The large color palette block (12+ constants) at the top is somewhat verbose and could be a dict or dataclass
- Inline HTML strings with embedded f-strings for styling make future modifications tedious
section_labelas a lambda is slightly unconventional (lambdas aren’t ideal for reusable UI components)- The layout nesting is fairly deep, which could complicate future restructuring
Criterion 2: Readability (Score: 7/10)
Strengths:
- Section headers with
──delimiters make it easy to navigate - Function names are descriptive (
on_preset_15,_update_presets,_result_card) - The
pn.bindreactive pattern is idiomatic Panel - Logic flow is easy to follow top-to-bottom
Weaknesses:
- No docstrings or comments explaining the calculation logic (though it’s simple math)
- Heavy use of inline HTML/CSS strings within Python reduces readability
- The
_result_cardfunction is fairly long and mixes layout construction with business logic - Color constant names like
CLR_PRESET_TXT_ONandCLR_PRESET_TXT_OFFare defined but never actually used in the code
Criterion 3: Requirement Adherence (Score: 9/10)
Strengths:
- ✅ Input field for bill amount (FloatInput)
- ✅ Tip percentage input with slider
- ✅ All three presets (15%, 18%, 20%) implemented as buttons with visual highlighting
- ✅ Instant calculation via
pn.bind— tip amount and total bill displayed reactively - ✅ Bill splitting feature works correctly (÷3 shown in after screenshot)
- ✅ Screenshots confirm correct math: $85.50 × 20% = $17.10 tip, $102.60 total, $34.20 per person
- ✅ No extra packages beyond Panel used
- ✅ Preset button highlighting updates correctly (20% shown as green in after screenshot)
Minor issues:
- Uses custom CSS via
styles=dict and inline HTML for styling — this is borderline given the “no extra CSS” requirement, though these are Panel-native mechanisms rather than external stylesheets - Two unused color constants (
CLR_PRESET_ON,CLR_PRESET_OFF,CLR_PRESET_TXT_ON,CLR_PRESET_TXT_OFF) defined but not referenced
Shiny for Python – Evaluation
I’ll evaluate this Shiny for Python tip calculator app on the three criteria.
Criterion 1: Maintainability (Score: 9/10)
Strengths:
- Well-structured separation between UI and server logic with clear section comments
- Single
calc()reactive that centralizes all computation — easy to modify formulas in one place - Logical grouping of UI elements into cards (inputs, results, split)
- Preset button handlers are clean and consistent (three separate
@reactive.effecthandlers) - Good use of
max(int(...), 1)guard to prevent division by zero - Named outputs are clearly mapped to their display elements
Minor concerns:
- The three preset handlers are slightly repetitive; a helper function could reduce boilerplate, but at this scale it’s acceptable
- The
calc()dictionary approach works but a dataclass/namedtuple would be more maintainable at larger scale
Criterion 2: Readability (Score: 9/10)
Strengths:
- Excellent use of section comments (
# ── Header ──,# ── Server ──, etc.) for visual scanning - Variable names (
bill,tip_pct,num_people,per_person_tip) are highly descriptive - Follows idiomatic Shiny for Python patterns (
@reactive.calc,@render.text,@reactive.effect) - The UI structure mirrors what the user sees, making it easy to map code to visual output
- Arithmetic is straightforward and easy to follow
Minor concerns:
_apply_15,_apply_18,_apply_20private naming convention is fine but the underscore prefix may confuse some readers- A brief comment explaining the
or 0.0/or 1fallback pattern would help
Criterion 3: Requirement Adherence (Score: 10/10)
Strengths:
- ✅ Input field for bill amount (numeric, default $50, step $0.01)
- ✅ Input field for tip percentage (numeric, editable)
- ✅ All three presets present: 15%, 18%, 20% as action buttons
- ✅ Instant calculation (reactive, no submit button needed)
- ✅ Displays tip amount ($17.10 for $85.50 × 20% ✓)
- ✅ Displays total bill ($102.60 = $85.50 + $17.10 ✓)
- ✅ Bill splitting among N people ($34.20 = $102.60 ÷ 3 ✓)
- ✅ No external CSS, JavaScript, or extra packages used
- ✅ Screenshots confirm correct calculations for the test case
- The bonus “split tip only” per person is a nice addition
All calculations verified against the after screenshot are mathematically correct.