Project 05 / 05 · American Airlines · 2022–Present

A 30-year refund mainframe, retired with zero downtime

American's passenger refund system had run on the same IBM mainframe for thirty years — the system of record for every fare reversal, tax adjustment, weather-cancellation refund and partial-use coupon the airline processed. Two converging deadlines forced its retirement at once: a Department of Transportation mandate to automate passenger refunds, and IBM's exit from its cloud business retiring the platform underneath. The hard part wasn't shipping new code — it was shipping it under a regulatory clock while the mainframe kept processing live refunds throughout the migration window. We rebuilt the whole platform as twenty event-driven services on Azure with a business-team-maintainable rules engine, ran the cutover in parallel until a reconciliation layer showed sustained zero divergence, and hit 93% refund automation on day one against an internal plan that had assumed lower.

Role
Senior Full-Stack Engineer
Dates
Nov 2022 – Present
Location
Fort Worth, TX
Domain
Airline / regulated refunds
  • React
  • Redux Toolkit
  • TanStack Query
  • React Hook Form
  • Drools
  • Azure
At a glance
Mainframe retired
30 yrs

Airline's passenger refund system, end-of-life on a federal regulatory deadline.

Event-driven
20 svcs

Independent services on Azure Service Bus — topics for fan-out, queues for guaranteed delivery.

Cutover
Zero

Downtime + refund discrepancies through the migration window. Parallel-run with reconciliation.

Day-one automation
93%

Rule packages tuned against historical refund data in the final weeks before launch.

The interface

Illustrative recreation · operable

An interactive recreation of the review console — fictional record locators and fictional refund data, built to convey the UI and the engineering decisions, not the shipped source. Watch a flagged request move through review: a quality rule fails, the reviewer corrects the field, the refund releases, and downstream post-processing fires.

param.dev/american-airlines/refund-audit
Open full screen ↗

Real-time status across the review queue

Refund status updates stream over the message bus so the reviewer's view stays current as requests move through correction, release and downstream post-processing — no refresh.

Server state, audit state, and form state kept apart

TanStack Query owns server state (refund records, live status). Redux Toolkit owns the reviewer's in-progress correction. React Hook Form owns the multi-step audit form. Each tool does one job.

Rules engine the business team owns

Channel-specific rule packages and an 11-rule quality controller — defined as decision tables, not code — so refund policy changes ship through a business workflow, not a release train.

Recreation · faithful to the original UI behavior, not the shipped source.

The cutover

Two systems, one window

IBM mainframe30 years · system of record
  • Batch-oriented refund processing on a single monolith
  • Any policy change meant a mainframe release cycle
  • End-of-life: the platform retiring under a federal automation deadline

Parallel run· both processed live refunds at once until a reconciliation layer held divergence at zero — then we cut over. Zero downtime, zero discrepancies.

The problem

Retiring a 30-year-old mainframe on a federal deadline

The passenger refund system was the system of record for every refund the airline processed — fare reversals, taxes, baggage charges, partial-use coupons, weather cancellations, involuntary cancellations during irregular operations. It ran on an IBM mainframe that had been in production for three decades. Refunds are a regulated financial workflow: access controls, document retention, and reconciliation are all audit-scoped.

The deadline was non-negotiable on two axes. A federal regulatory mandate required automated passenger refunds by October 2025. At the same time, IBM was exiting its cloud business and retiring the platform underneath the existing system. The migration window could not include a single minute of downtime — the mainframe was processing live customer refunds throughout, and any visible drop in availability would have been a regulator finding.

The rebuild is a twenty-service event-driven platform on Azure with a business-team-maintainable rules engine, document-store persistence with full event history, end-to-end observability, and a parallel-run cutover that reconciled outputs against the mainframe in real time before takeover. The review console in the demo above is one piece of the surface: it's where reviewers land for the requests the rule engine deliberately stops on.

The mainframe was processing live refunds throughout the migration window. There was no slot where refund processing could stop.
The constraint that defined the build

Key decisions

Three architectural calls behind the rebuild

Migrating off a 30-year mainframe under a federal deadline with zero allowed downtime gives every architectural decision the same test: does this hold up while the old system is still running, and can the business team keep it running after we ship?

01Decision

Event-driven on Azure Service Bus, not synchronous orchestration

Twenty independent services, each owning one step in the refund lifecycle — request validation, eligibility orchestration, rule-based determination, persistence, post-processing, quality control, payment handoff. Topics and subscriptions fan out the non-blocking events across consumers; queues carry the guaranteed-delivery handoffs (payment instructions, regulator-facing audit writes) where a lost message would be a compliance finding. Native dead-letter queues, duplicate detection, and replay are in the substrate, not bolted on.

The catch· Twenty deployable services is more operational surface than a single application. In exchange, services scale, redeploy and recover independently — one slow consumer never blocks the rest of the pipeline.

02Decision

Rules engine the business team owns, with BDD specs gating change

Refund policy changes monthly with regulatory and business needs — fare structures, channel-specific endorsement restrictions, weather-cancellation overrides. The eligibility determiner runs on a Java-based rules engine with channel-specific rule packages (web, mobile, agency) defined as decision tables and rule files — auditable artifacts the business team can adjust without an engineering deploy. The business team co-authored Cucumber scenarios before any rule shipped, so every rule change is regression-tested against business intent.

The catch· A rules engine to host, version and explain to a non-engineering audience. In exchange, monthly policy changes ship through a business workflow, not a release train — and the rules themselves become first-class compliance artifacts.

03Decision

Parallel-run cutover with output reconciliation

Both the mainframe and the new platform processed live refunds simultaneously through the migration window. A reconciliation layer compared outputs in real time and surfaced any divergence for engineering review before either system released downstream. We didn't flip the switch until reconciliation showed sustained zero divergence on production traffic — slow in the moment, but the only safe play for an active regulated financial system.

The catch· A longer migration window and the cost of running two systems in parallel. In exchange: zero downtime, zero refund discrepancies, and a defensible audit trail showing the new system was validated against the old before takeover.

What I built

The rule engine, the review console, and the cutover

I own the rule engine end to end. The validation service that checks request shape and integrity. The eligibility orchestrator that gathers enrichment — fare rules, endorsement restrictions, agency codes — before forwarding to the determiner. The determiner itself with channel-specific rule packages for web, mobile and agency flows, defined as decision tables the business team maintains. And the eleven-rule quality controller — checks like whether the refund amount exceeds the original sale, whether it's being issued via paper check, whether it crosses thresholds requiring secondary review — that gates auto-clear versus the manual review queue.

On the front end I built and currently maintain the React review console for the requests automation deliberately stops on. Redux Toolkit owns client-side audit state, TanStack Query owns server state, TanStack Router gives each view a typed route, React Hook Form owns the multi-field correction forms. Real-time updates stream over the message bus so the reviewer's view stays current as requests move through release and downstream post-processing without a refresh. The customer-facing Angular refund workflow UI ships alongside (standalone components, RxJS, OnPush change detection).

Tests were BDD-first with Cucumber on the rule engine — the business team co-authored scenarios before any rule shipped, so every rule change is regression-tested against business intent. The observability stack the team still relies on for on-call covers throughput and percentile-latency dashboards, distributed tracing across services, and log aggregation. The cutover ran parallel with a reconciliation layer comparing outputs in real time; we didn't switch over until divergence held at zero on production traffic. Zero downtime, zero refund discrepancies through the migration window.

Back to

All selected work