Internship at the GoA Ministry of Environment and Protected Areas - Software and Data Engineering
01 — Overview
In support of Alberta's TIER regulation I designed and maintained a Django-based data platform supporting multiple applications across departments, enabling scalable data storage and streamlined workflows.
As a result a fragmented, manual data ecosystem was transformed into a unified, automated platform. Data from 300+ sources was centralized into a single source of truth, workflows became significantly faster, and system reliability improved through robust validation and testing. The platform enabled scalable collaboration across departments while reducing operational overhead and manual intervention
reduced manual workflows by ~7.5x and accelerated data access by ~2x through automation, improved UI flows, and advanced Excel ingestion.
built ETL pipelines to transform Excel data into structured JSON and TOML formats, standardizing over 300 profiles spanning 2018–2024 while ensuring future regulatory compatibility.
developed a Leaflet-based mapping system to automate location visualization across ~600 sites, eliminating manual Google Earth workflows.
02 — Process
Environmental reporting data arrived as Excel files. Hundreds of sources, six-plus years of submissions, each with slight variations in schema, column names, and metadata. The real issue wasn't volume. It was the absence of structure. Manual cleaning didn't scale, and any one-off fix would break with the next regulatory change or form revision.
Older metrics lived in separate sheets, exposed to being overwritten or lost. Forms were clearly versioned, but the fragmented file-per-submission nature made cross-facility comparison and aggregation unreliable. All the data existed, it just wasn't in a shape that downstream systems could trust.
TIER compliance requires accurate, comparable emissions data across facilities. Inconsistent source formats were directly obstructing the government's ability to track trends, enforce compliance, and adapt reporting requirements over time.
I designed a flexible ETL pipeline in Python, using OpenPyXL, PyWin32, YAML, and VBA, built around one principle: schema-tolerance. Rather than hardcoding transformations for each form version, the pipeline adapts to minor structural variations without breaking.
Ingested heterogeneous Excel formats and normalised them into a consistent schema regardless of source version.
Dynamically identified and standardised metadata fields — facility IDs, reporting periods, units — across all form variants.
Manipulated embedded VBA scripts to ensure emissions calculations ran correctly before extraction, preserving formula logic without manual intervention.
Converted results into structured JSON/TOML, ready for SQL ingestion and downstream analysis pipelines.
Manual preprocessing effort dropped significantly. Ingestion workflows are now future-proofed against new form versions, and data is consistently available for analysis, automation, and regulatory calculation — without a human in the loop for each new submission cycle.
Engineering teams were manually comparing KML files side-by-side to check coordinate consistency across ~600 facilities and ~5,700 files. The bottleneck wasn't access to the data — it was that the process couldn't run reliably or repeatably without a human in the loop. Any inconsistency required visual inspection rather than automated detection.
I built a KML-based geospatial visualisation tool using Leaflet that replaced manual comparison with automated spatial auditing.
Dynamically rendered ~600 facilities on an interactive map from live database records.
Detected overlapping and inconsistent KML submissions across ~5,700 files without manual lookup.
Highlighted spatial conflicts and anomalies directly on the map — no cross-referencing spreadsheets.
Ran automated checks for malformed or structurally inconsistent geospatial data at scale.
Spatial conflict identification dropped from hours to minutes. Geospatial data integrity improved across all datasets, with a repeatable process that runs without human intervention.
Critical regulatory data from external agencies required frequent manual updates, introducing lag, inconsistencies, and risk of outdated information reaching downstream systems. The process needed to run continuously and reliably without human intervention.
I developed automated web scraping pipelines using BeautifulSoup, built around two non-negotiables: idempotency and resilience.
Pulled structured data from multiple external agency sources on a continuous, automated schedule.
Standardised and integrated updates into internal databases regardless of source format differences.
Designed so repeated runs produced identical results — no duplicate records, no data corruption.
Handled missing fields and structural HTML changes gracefully, maintaining historical records for compatibility.
Eliminated manual update workflows entirely. Data stays consistent and current across systems automatically — reducing operational overhead, human error, and the lag between external updates and internal visibility.
KPI calculations were manually performed across hundreds of submissions, producing inconsistent results and low confidence in reported metrics. There was no reliable way to verify outputs or compare performance consistently across facilities — a direct risk to compliance reporting credibility.
I built an automated KPI calculation and validation pipeline to standardise and verify metrics across all submissions.
Processed 300+ submissions using consistent metric definitions — no per-file manual work.
Ran automated checks against expected outputs to catch discrepancies before results were used downstream.
Pushed verified results directly into a centralised database — single source of truth for all metrics.
Maintained full transparency in calculation logic to support regulatory auditability and reproducibility.
Eliminated manual KPI workflows entirely. Metrics are now audit-ready by default — consistent definitions, automated validation, and a centralised record make compliance reporting defensible and scalable.
04 — Challenges & Reflection
One of the most challenging aspects of this work was aligning technical systems with evolving regulatory requirements. Unlike traditional software projects, the data structures and workflows were directly influenced by policy decisions, which required translating ambiguous regulatory language into precise, enforceable logic within the system.
Working closely with policy makers and domain experts, I had to develop a working understanding of emissions reporting standards, facility classifications, and the underlying environmental context. This meant going beyond implementation, researching policy documents, clarifying intent behind requirements, and ensuring that the system reflected not just what was specified, but what was actually needed for accurate compliance tracking.
On the engineering side, collaborating across teams required balancing flexibility with consistency. Different systems and stakeholders had varying assumptions about data formats and workflows, so I focused on building resilient pipelines and validation layers that could adapt to change without breaking.
Building effective systems in a regulatory environment requires more than technical execution. It demands the ability to translate policy into code, design for ambiguity, and create systems that remain robust as both requirements and data evolve.
A major learning was the importance of domain context. Understanding the chemistry behind emissions data and how it translated into policy thresholds allowed me to design more meaningful validation checks and metrics. Instead of treating data as abstract inputs, I began to see how inaccuracies could directly impact compliance decisions and environmental outcomes.
This context also shaped product decisions. For example, prioritizing transparency and auditability in calculations wasn’t just a technical choice, it was necessary for stakeholder trust and regulatory review. Similarly, designing tools that aligned with existing workflows (like Excel-based processes) ensured adoption across non-technical users.
Overall, this experience strengthened my ability to operate at the intersection of engineering, policy, and domain knowledge, delivering systems that are not only technically sound, but also usable, trustworthy, and aligned with real-world constraints.
Cross-functional collaboration (engineering + policy), translating regulatory requirements into system design, domain-driven data modeling, stakeholder communication, and building audit-ready, production-grade data systems.