Web Framework Evaluation

Generating tip calculator apps across 4 Python web frameworks with a single LLM prompt

Prompt & Model

Generator model: claude-sonnet-4-6 via chatlas ChatAnthropic with tool_web_search() enabled

Evaluator model: claude-sonnet-4-6 via Inspect AI

Prompt given to the LLM to generate each app:

Create an app using {framework} that is a basic tip calculator which features a clean interface with input fields for the bill amount and tip percentage (with 15%, 18%, or 20% presets). It must instantly calculate and display the tip amount and the total bill (bill + tip), with an optional, useful feature to split the total among multiple people. Do not use extra packages or CSS or JavaScript other than what {framework} already provides.

The {framework} placeholder was filled with each of: Streamlit, Plotly Dash, Panel, and Shiny for Python.

Frameworks Generated

Framework Directory Run Command
Streamlit streamlit/ streamlit run streamlit/app.py
Plotly Dash dash/ python dash/app.py
Panel panel/ panel serve panel/app.py
Shiny for Python shiny/ shiny run shiny/app.py

Test Scenario

Each app was started, given a bill of $85.50 with a 20% tip split among 3 people.

Before & After Screenshots

Streamlit

Before

After

Plotly Dash

Before

After

Panel

Before

After

Shiny for Python

Before

After

Evaluation with Inspect AI

Each generated app was scored by claude-sonnet-4-6 (via Inspect AI) on three criteria (1–10 scale). The scorer received the full app source code plus before/after screenshots:

  • Maintainability – Is the code well-structured, modular, and easy to extend?
  • Readability – Is it intuitive to read and idiomatic for the framework?
  • Requirement Adherence – Does it meet every specification and do the screenshots confirm it works?

Score Summary

Framework Maintainability Readability Adherence
Streamlit 8 9 7
Plotly Dash 8 8 10
Panel 7 7 9
Shiny for Python 9 9 10
Mean 8.00 8.25 9.00

Streamlit – Evaluation

I’ll analyze the app code and screenshots across the three criteria.

Criterion 1: Maintainability (Score: 8/10)

Strengths:

  • Code is well-organized with clear section comments using visual separators (# ──)
  • Logical flow: config → inputs → calculations → outputs → split feature
  • Session state is used appropriately for the tip percentage persistence
  • Calculations are isolated into simple, clear formulas
  • Good use of columns for layout consistency

Weaknesses:

  • The summary caption at the bottom references num_people which could cause a NameError if split_on is True but the variable is somehow not defined (minor scoping risk)
  • No functions/helpers — everything is inline, which makes future refactoring slightly harder
  • The “Splitting … among 2 people” bug (visible in after screenshot) indicates a display logic issue

Criterion 2: Readability (Score: 9/10)

Strengths:

  • Excellent inline comments with clear section headers
  • Descriptive variable names (bill_amount, tip_pct, per_person_total, etc.)
  • Follows idiomatic Streamlit patterns (session state, columns, metrics)
  • Help text on inputs adds to self-documentation
  • Logical top-to-bottom flow mirrors the UI layout

Weaknesses:

  • Minor: the session state sync pattern (setting after slider) is slightly confusing since the buttons set it before and the slider reads it — understandable but not immediately obvious

Criterion 3: Requirement Adherence (Score: 8/10)

Strengths:

  • ✅ Input field for bill amount
  • ✅ 15%, 18%, 20% preset buttons are present
  • ✅ Custom tip slider also available
  • ✅ Instant calculation of tip amount and total bill displayed
  • ✅ Bill splitting feature implemented
  • ✅ No extra packages, CSS, or JavaScript used
  • ✅ Before screenshot confirms clean default state

Weaknesses:

  • Bug confirmed in after screenshot: “Splitting $102.60 among 2 people” is shown even though 3 people are entered (the number input shows 3 but the text says 2 — likely a Streamlit re-run/state timing issue where num_people lags one interaction)
  • The split toggle adds a small UX friction — could default to showing the split section more prominently
  • The “20% — Great” button in the after screenshot appears slightly highlighted but there’s no clear visual indicator of the selected preset

Plotly Dash – Evaluation

I’ll evaluate this Plotly Dash tip calculator app across the three criteria.


Criterion 1: Maintainability

Strengths:

  • Style constants are centralized at the top using named color variables (PRIMARY, CARD_BG, etc.), making global theme changes easy
  • The two callbacks are well-separated by concern (one for syncing presets, one for calculations)
  • The layout is logically organized into clearly named sections (Bill Amount, Tip Percentage, Results, Split)
  • Style dictionaries use composition ({**preset_btn_base, ...}) to reduce duplication
  • Good use of constants to avoid magic numbers/strings

Weaknesses:

  • The large number of style dictionaries (15+) at the module level creates a long file that’s somewhat monolithic — a larger project would benefit from separating styles
  • The sync_preset callback imports ctx inside the function rather than at module level
  • Some mild logic duplication in the btn_style inner function

Score: 8/10


Criterion 2: Readability

Strengths:

  • Excellent use of section comments (# ── Bill Amount ──...) that visually separate layout blocks
  • Variable names are descriptive and self-documenting (bill_err, tip_amount, per_person_str)
  • The preset_map dictionary elegantly maps button IDs to values
  • The file-level comment “Tip Calculator — Plotly Dash App” and section headers guide readers well
  • Callback logic is clean and follows a clear validate-then-compute pattern

Weaknesses:

  • The density of style dictionaries near the top is visually overwhelming and may slow down reading the actual app logic
  • Minor: from dash import ctx inside the callback is unconventional and could confuse readers expecting all imports at the top

Score: 8/10


Criterion 3: Requirement Adherence

Strengths:

  • ✅ Input field for bill amount with proper number input
  • ✅ Input field for tip percentage (custom)
  • ✅ All three preset buttons (15%, 18%, 20%) present and functional
  • ✅ Active preset button highlights correctly (20% shown filled/blue in “after” screenshot)
  • ✅ Instant calculation — tip amount ($17.10) and total bill ($102.60) shown correctly for $85.50 @ 20%
  • ✅ Split among multiple people works correctly ($102.60 ÷ 3 = $34.20 confirmed in screenshot)
  • ✅ No external CSS files, JS files, or extra packages used — only Dash’s built-in inline styles
  • ✅ Input validation with error messages for edge cases
  • ✅ Clean, professional interface

Weaknesses:

  • Essentially none — all requirements are fully met and verified by screenshots

Score: 10/10


Panel – Evaluation

I’ll evaluate this Panel tip calculator app on three criteria.

Criterion 1: Maintainability (Score: 7/10)

Strengths:

  • The code is well-organized into logical sections (color palette, widgets, callbacks, result pane, layout) with clear section comments using ── separators
  • The _result_card function centralizes all calculation logic cleanly
  • The _update_presets helper avoids repetition across the three preset callbacks
  • The row() inner function inside _result_card is a nice reusable pattern for result rows
  • Color constants are defined at the top, making theme changes easy

Weaknesses:

  • The large color palette block (12+ constants) at the top is somewhat verbose and could be a dict or dataclass
  • Inline HTML strings with embedded f-strings for styling make future modifications tedious
  • section_label as a lambda is slightly unconventional (lambdas aren’t ideal for reusable UI components)
  • The layout nesting is fairly deep, which could complicate future restructuring

Criterion 2: Readability (Score: 7/10)

Strengths:

  • Section headers with ── delimiters make it easy to navigate
  • Function names are descriptive (on_preset_15, _update_presets, _result_card)
  • The pn.bind reactive pattern is idiomatic Panel
  • Logic flow is easy to follow top-to-bottom

Weaknesses:

  • No docstrings or comments explaining the calculation logic (though it’s simple math)
  • Heavy use of inline HTML/CSS strings within Python reduces readability
  • The _result_card function is fairly long and mixes layout construction with business logic
  • Color constant names like CLR_PRESET_TXT_ON and CLR_PRESET_TXT_OFF are defined but never actually used in the code

Criterion 3: Requirement Adherence (Score: 9/10)

Strengths:

  • ✅ Input field for bill amount (FloatInput)
  • ✅ Tip percentage input with slider
  • ✅ All three presets (15%, 18%, 20%) implemented as buttons with visual highlighting
  • ✅ Instant calculation via pn.bind — tip amount and total bill displayed reactively
  • ✅ Bill splitting feature works correctly (÷3 shown in after screenshot)
  • ✅ Screenshots confirm correct math: $85.50 × 20% = $17.10 tip, $102.60 total, $34.20 per person
  • ✅ No extra packages beyond Panel used
  • ✅ Preset button highlighting updates correctly (20% shown as green in after screenshot)

Minor issues:

  • Uses custom CSS via styles= dict and inline HTML for styling — this is borderline given the “no extra CSS” requirement, though these are Panel-native mechanisms rather than external stylesheets
  • Two unused color constants (CLR_PRESET_ON, CLR_PRESET_OFF, CLR_PRESET_TXT_ON, CLR_PRESET_TXT_OFF) defined but not referenced

Shiny for Python – Evaluation

I’ll evaluate this Shiny for Python tip calculator app on the three criteria.

Criterion 1: Maintainability (Score: 9/10)

Strengths:

  • Well-structured separation between UI and server logic with clear section comments
  • Single calc() reactive that centralizes all computation — easy to modify formulas in one place
  • Logical grouping of UI elements into cards (inputs, results, split)
  • Preset button handlers are clean and consistent (three separate @reactive.effect handlers)
  • Good use of max(int(...), 1) guard to prevent division by zero
  • Named outputs are clearly mapped to their display elements

Minor concerns:

  • The three preset handlers are slightly repetitive; a helper function could reduce boilerplate, but at this scale it’s acceptable
  • The calc() dictionary approach works but a dataclass/namedtuple would be more maintainable at larger scale

Criterion 2: Readability (Score: 9/10)

Strengths:

  • Excellent use of section comments (# ── Header ──, # ── Server ──, etc.) for visual scanning
  • Variable names (bill, tip_pct, num_people, per_person_tip) are highly descriptive
  • Follows idiomatic Shiny for Python patterns (@reactive.calc, @render.text, @reactive.effect)
  • The UI structure mirrors what the user sees, making it easy to map code to visual output
  • Arithmetic is straightforward and easy to follow

Minor concerns:

  • _apply_15, _apply_18, _apply_20 private naming convention is fine but the underscore prefix may confuse some readers
  • A brief comment explaining the or 0.0 / or 1 fallback pattern would help

Criterion 3: Requirement Adherence (Score: 10/10)

Strengths:

  • ✅ Input field for bill amount (numeric, default $50, step $0.01)
  • ✅ Input field for tip percentage (numeric, editable)
  • ✅ All three presets present: 15%, 18%, 20% as action buttons
  • ✅ Instant calculation (reactive, no submit button needed)
  • ✅ Displays tip amount ($17.10 for $85.50 × 20% ✓)
  • ✅ Displays total bill ($102.60 = $85.50 + $17.10 ✓)
  • ✅ Bill splitting among N people ($34.20 = $102.60 ÷ 3 ✓)
  • ✅ No external CSS, JavaScript, or extra packages used
  • ✅ Screenshots confirm correct calculations for the test case
  • The bonus “split tip only” per person is a nice addition

All calculations verified against the after screenshot are mathematically correct.