Case Study — Software Testing & Quality Engineering

Tartan
Smart Home

Java JUnit JaCoCo SpotBugs PMD ErrorProne SonarQube GitHub Actions MySQL Docker

A production-grade testing pipeline

Over five milestones in CMPUT 402 (Software Quality), our team hardened the open-source Tartan Smart Home system, a rules-based home automation platform, through unit testing, coverage analysis, live A/B experimentation, static analysis, and technical debt auditing. I was responsible for test authoring, coverage instrumentation, bug discovery and fix documentation, and the A/B experiment infrastructure.

Role
Test Engineer; unit tests, coverage analysis, bug fixes, CI integration
Team
4 developers
Deliverables
Testing Tartan Smart Home; Java/Dropwizard rules engine for IoT home automation
79%
Line Coverage achieved

Starting from untested code, our black-box and white-box test suite reached 79% line coverage and 66% branch coverage on the core rules evaluator.

1
Logic bug discovered & fixed

White-box analysis via JaCoCo uncovered a dead else if branch in the lighting rule, a latent bug that would have silently misfired in production.

3→5
Days of tech debt discovered

SonarQube tracked how our feature additions grew technical debt from 3d 6h (baseline) to 5d 1h; quantifying the cost of moving fast.

A/B
Live experiment shipped

Built and ran a real A/B experiment in production Docker containers showing that cost-framed reports reduced light usage more than time-framed ones.

From zero coverage to 79%

The system enforces five home-automation rules. My job was to break them; and then prove they were fixed. Each rule got a black-box test suite first (what should happen), then white-box analysis via JaCoCo to find the branches our tests couldn't reach.

Metric Result Coverage
Instruction Coverage 77%
Branch Coverage 66%
Line Coverage 79%

Rules tested & bugs found

Each rule was tested with a named suite (R1A, R1B…). JaCoCo reports drove targeted additions, tests weren't added to inflate numbers, but to hit real uncovered branches.

R1: Vacancy & Lights

Bug found: lights couldn't be turned off at home

The system auto-forced lights on when occupied, preventing residents from switching them off at night. Fix: lights only auto-on when occupant first enters (door-open event), not any time the house is occupied.

R2: Alarm & Door

Edge case: alarm with occupant still present

JaCoCo flagged an uncovered branch where an alarm triggered while occupant was still preparing to leave. Added R2E to cover the "door open, alarm enabled, house still occupied" state.

R3: Vacancy & Door

Dead code found in door-close logic

Two branches flagged: the system attempted to close an already-closed door, and the away-timer deactivation path was unreachable. Added R3D and R3E to expose both states.

R1: White Box

Dead else if

JaCoCo showed else if (lightState) never fired, it was logically equivalent to the preceding if, making the second block unreachable. Replaced with a plain else. Without coverage tooling this would have shipped.

Testing approach by milestone

G1

Unit Tests

  • Wrote named JUnit tests for each system rule using only the public interface.
  • Set up GitHub Actions CI so every branch ran the test suite automatically.
G2 Coverage
  • JaCoCo white-box analysis & bug fixes
  • Instrumented the build with JaCoCo to generate branch-level coverage reports.
  • Discovered and fixed the dead code
G3 In Production
  • Implemented usage tracking (light minutes, HVAC uptime) and a group-based reporting system.
  • A/B experiment shipped in live Docker containers
G4 Static Analysis
  • SpotBugs + PMD + ErrorProne — 280+ issues surfaced
  • Ran three static analysers and cross-referenced their output to identify overlaps, unique catches, and false positives.
  • Identified the 10 highest-priority issues by category and actionability.
G5 Tech Debt
  • Ran SonarQube on both the baseline and our modified branch.
  • Quantified codebase's technical debt growth
  • Proposed a concrete CI pipeline upgrade

Does framing change behaviour?

We instrumented the running system to track real usage, then split houses into groups receiving different feedback. A genuine A/B test, not a simulation. The question: does seeing your energy bill change how you use the smart home?

Control Group

Neutral baseline report

Houses in Group 1 received general energy usage summaries, total minutes the lights and HVAC ran, without cost framing or behavioural prompts.

Experimental Group

Cost-projected feedback

Houses in Group 2 received the same data translated into estimated electricity costs. Same underlying numbers, different framing but designed to create personal stakes by quantifying usage in dollars.

Infrastructure

The experiment was controlled via a groupExperiment flag set randomly per house in config.docker.yml. Every interaction was group-tagged. Switching a house's treatment required one config line change and no code deploy to simplify rollbacks.

Results were queried directly from MySQL using a time-buckets and then cleaned in Excel for charting. The data showed cost framing reduced average light usage, though HVAC usage was less responsive to either report type, likely because HVAC behaviour is more habitual and less in-the-moment.

Light usage - cost group

Cost-framed feedback correlated with measurably lower light-on minutes compared to the neutral baseline group.

HVAC usage - both groups

HVAC behaviour was similar across groups, suggesting habitual use is less responsive to feedback framing than discretionary actions like lighting.

Three tools, 280+ issues

Running SpotBugs, PMD, and ErrorProne against the same codebase revealed how different tools catch different things and how their outputs often overlap, conflict, or mislead without a centralised view.

SpotBugs
42
warnings, 0 errors
Bad Practice (15) Malicious Code (14) Dodgy Code (4) Performance (3) Correctness (2)
PMD
170
errors, 0 warnings
ControlStatementBraces (20) NamingConventions EmptyCatchBlock LooseCoupling UnusedLocalVariable
ErrorProne
69
1 error, 68 warnings
JdkObsolete SynchronizeOnNonFinalField UnusedVariable EmptyCatch EqualsHashCode

Technical Debt - Baseline vs. Our Branch

SonarQube gave us what the individual tools couldn't: a single number for the cost of our decisions. Adding features to the system grew technical debt by nearly two full working days.

Baseline (original Tartan code)
3d 6h
185 maintainability issues · SonarQube quality gate: ✓ pass
Main branch (with our features)
5d 1h
313 issues (+128) · 1 blocker added · Gate: ✓ still passes

Proposed pipeline improvements

The main takeaway: three separate tools with separate outputs is harder to act on than one consolidated view. Our recommendation was to integrate SonarQube directly into GitHub Actions, retire PMD (too much overlap, harder to parse), keep SpotBugs and ErrorProne for build-time feedback, and set strict security gate thresholds so debt accumulation is caught before merge, not after.

What I actually learned

This wasn't a toy testing exercise, it was working on a real codebase with real bugs, real coverage gaps, and a real A/B experiment running in production containers. The gap between "tests pass" and "software is correct" became very concrete.

Coverage numbers lie without interpretation

77% instruction coverage sounds good. But a 66% branch rate meant there were whole decision paths we couldn't verify. JaCoCo didn't just tell me what was covered; it told me where to look next.

Static analysis tools disagree on purpose

PMD found 170 issues. SpotBugs found 42. They overlap on some things and diverge wildly on others. Running all three and cross-referencing taught me that no single tool has the full picture, and false positives are part of the work.

Experiment infrastructure matters as much as the experiment

The A/B feature was only as useful as the config flag that controlled it. Designing the groupExperiment propagation so it touched the DB schema, the service layer, and the UI in one pass meant we could change treatments without redeploying.

Technical debt is a real unit of time

Seeing "1 day 19 hours of new debt" attached to our pull requests made abstract quality concerns concrete. SonarQube's debt estimation isn't perfect, but it's a far better conversation starter than "this code is messy."