Internship at the GoA Ministry of Environment and Protected Areas - Software and Data Engineering

Public-Sector
Tools

Python Django ETL pipelines Micorsoft SQL Server Python JSON NoSQL Git
Scroll to explore

Built data systems with ETL and geospatial tools

Role
Software and Data Engineering
Stack
Python, Django, SQL, NoSQL,Yaml,Excel, JavaScript, HTMX
Team
2 developers serving 4 teams
Context
Government of Alberta — Environmental data systems internship

In support of Alberta's TIER regulation I designed and maintained a Django-based data platform supporting multiple applications across departments, enabling scalable data storage and streamlined workflows.

As a result a fragmented, manual data ecosystem was transformed into a unified, automated platform. Data from 300+ sources was centralized into a single source of truth, workflows became significantly faster, and system reliability improved through robust validation and testing. The platform enabled scalable collaboration across departments while reducing operational overhead and manual intervention

Efficiency gains:

reduced manual workflows by ~7.5x and accelerated data access by ~2x through automation, improved UI flows, and advanced Excel ingestion.

Data automation:

built ETL pipelines to transform Excel data into structured JSON and TOML formats, standardizing over 300 profiles spanning 2018–2024 while ensuring future regulatory compatibility.

Geospatial tooling:

developed a Leaflet-based mapping system to automate location visualization across ~600 sites, eliminating manual Google Earth workflows.

330+
sources
consolidated
2
regulatory agencies
scraped
600+

GIS mapped facilities
~7.5x
faster
user workflows

02 — Process

ETL Pipeline

The Problem

Environmental reporting data arrived as Excel files. Hundreds of sources, six-plus years of submissions, each with slight variations in schema, column names, and metadata. The real issue wasn't volume. It was the absence of structure. Manual cleaning didn't scale, and any one-off fix would break with the next regulatory change or form revision.

Older metrics lived in separate sheets, exposed to being overwritten or lost. Forms were clearly versioned, but the fragmented file-per-submission nature made cross-facility comparison and aggregation unreliable. All the data existed, it just wasn't in a shape that downstream systems could trust.

Why it mattered

TIER compliance requires accurate, comparable emissions data across facilities. Inconsistent source formats were directly obstructing the government's ability to track trends, enforce compliance, and adapt reporting requirements over time.

The System

I designed a flexible ETL pipeline in Python, using OpenPyXL, PyWin32, YAML, and VBA, built around one principle: schema-tolerance. Rather than hardcoding transformations for each form version, the pipeline adapts to minor structural variations without breaking.

01
Parse

Ingested heterogeneous Excel formats and normalised them into a consistent schema regardless of source version.

02
Extract

Dynamically identified and standardised metadata fields — facility IDs, reporting periods, units — across all form variants.

03
Calculate

Manipulated embedded VBA scripts to ensure emissions calculations ran correctly before extraction, preserving formula logic without manual intervention.

04
Output

Converted results into structured JSON/TOML, ready for SQL ingestion and downstream analysis pipelines.

Impact
300+
Facility profiles standardised
6yrs
Of historical data unified
1
Source of truth for downstream systems
Outcome

Manual preprocessing effort dropped significantly. Ingestion workflows are now future-proofed against new form versions, and data is consistently available for analysis, automation, and regulatory calculation — without a human in the loop for each new submission cycle.

GIS Validation Tool

The Problem

Engineering teams were manually comparing KML files side-by-side to check coordinate consistency across ~600 facilities and ~5,700 files. The bottleneck wasn't access to the data — it was that the process couldn't run reliably or repeatably without a human in the loop. Any inconsistency required visual inspection rather than automated detection.

The System

I built a KML-based geospatial visualisation tool using Leaflet that replaced manual comparison with automated spatial auditing.

01
Plot

Dynamically rendered ~600 facilities on an interactive map from live database records.

02
Audit

Detected overlapping and inconsistent KML submissions across ~5,700 files without manual lookup.

03
Flag

Highlighted spatial conflicts and anomalies directly on the map — no cross-referencing spreadsheets.

04
Validate

Ran automated checks for malformed or structurally inconsistent geospatial data at scale.

Impact
600+
Facilities mapped
5.7k
Files validated
0
Manual comparisons needed
Outcome

Spatial conflict identification dropped from hours to minutes. Geospatial data integrity improved across all datasets, with a repeatable process that runs without human intervention.

Automated Data Ingestion

The Problem

Critical regulatory data from external agencies required frequent manual updates, introducing lag, inconsistencies, and risk of outdated information reaching downstream systems. The process needed to run continuously and reliably without human intervention.

The System

I developed automated web scraping pipelines using BeautifulSoup, built around two non-negotiables: idempotency and resilience.

01
Extract

Pulled structured data from multiple external agency sources on a continuous, automated schedule.

02
Normalise

Standardised and integrated updates into internal databases regardless of source format differences.

03
Idempotent

Designed so repeated runs produced identical results — no duplicate records, no data corruption.

04
Resilient

Handled missing fields and structural HTML changes gracefully, maintaining historical records for compatibility.

Impact
Outcome

Eliminated manual update workflows entirely. Data stays consistent and current across systems automatically — reducing operational overhead, human error, and the lag between external updates and internal visibility.

Metric Automation System

The Problem

KPI calculations were manually performed across hundreds of submissions, producing inconsistent results and low confidence in reported metrics. There was no reliable way to verify outputs or compare performance consistently across facilities — a direct risk to compliance reporting credibility.

The System

I built an automated KPI calculation and validation pipeline to standardise and verify metrics across all submissions.

01
Calculate

Processed 300+ submissions using consistent metric definitions — no per-file manual work.

02
Validate

Ran automated checks against expected outputs to catch discrepancies before results were used downstream.

03
Integrate

Pushed verified results directly into a centralised database — single source of truth for all metrics.

04
Audit

Maintained full transparency in calculation logic to support regulatory auditability and reproducibility.

Impact
300+
Submissions standardised
Trust in reported metrics
Outcome

Eliminated manual KPI workflows entirely. Metrics are now audit-ready by default — consistent definitions, automated validation, and a centralised record make compliance reporting defensible and scalable.

What I learned

One of the most challenging aspects of this work was aligning technical systems with evolving regulatory requirements. Unlike traditional software projects, the data structures and workflows were directly influenced by policy decisions, which required translating ambiguous regulatory language into precise, enforceable logic within the system.

Working closely with policy makers and domain experts, I had to develop a working understanding of emissions reporting standards, facility classifications, and the underlying environmental context. This meant going beyond implementation, researching policy documents, clarifying intent behind requirements, and ensuring that the system reflected not just what was specified, but what was actually needed for accurate compliance tracking.

On the engineering side, collaborating across teams required balancing flexibility with consistency. Different systems and stakeholders had varying assumptions about data formats and workflows, so I focused on building resilient pipelines and validation layers that could adapt to change without breaking.

Key Takeaway

Building effective systems in a regulatory environment requires more than technical execution. It demands the ability to translate policy into code, design for ambiguity, and create systems that remain robust as both requirements and data evolve.

A major learning was the importance of domain context. Understanding the chemistry behind emissions data and how it translated into policy thresholds allowed me to design more meaningful validation checks and metrics. Instead of treating data as abstract inputs, I began to see how inaccuracies could directly impact compliance decisions and environmental outcomes.

This context also shaped product decisions. For example, prioritizing transparency and auditability in calculations wasn’t just a technical choice, it was necessary for stakeholder trust and regulatory review. Similarly, designing tools that aligned with existing workflows (like Excel-based processes) ensured adoption across non-technical users.

Overall, this experience strengthened my ability to operate at the intersection of engineering, policy, and domain knowledge, delivering systems that are not only technically sound, but also usable, trustworthy, and aligned with real-world constraints.

Skills Built

Cross-functional collaboration (engineering + policy), translating regulatory requirements into system design, domain-driven data modeling, stakeholder communication, and building audit-ready, production-grade data systems.