N·TELL

Pandanus Reach Civic Harmony Pilot

Internal Training Dossier — Power Query Familiarisation, Six Exercises

Working Datasets

Place these files in a single folder before commencing Exercise 02. Buttons below assume the dossier and the data sit in the same directory; if your shell has been hardened by IT, contact REDACTED.

citizens.csv3,005 rows · master register households.csv1,150 rows · dwellings locations.csv30 rows · sensor nodes reason_codes.csv30 rows · lookup table surveillance_events_2026_01.csv5,070 rows · January surveillance_events_2026_02.csv4,806 rows · February surveillance_events_2026_03.csv4,621 rows · March (schema drift) surveillance_events_2026_04.csv5,371 rows · April surveillance_events_2026_05.csv4,981 rows · May surveillance_events_2026_06.csv4,606 rows · June search_logs.csv15,000 rows · query logs social_credit_history.csv12,000 rows · score deltas

Foundations: Single-File Hygiene
Combine Files from a Folder
Merges: Building the Citizen Story
Demographic Predictive Validity
Reshaping for the Steering Committee
Custom M Classifiers

On the Origins of the Dataset

The Pandanus Reach Covert Statistics Office occupies a converted bottle shop on Federation Street. We have retained the original signage; we feel it adds character.

The PRCSO was established in early 2024 under amendments to the Northern Region Information Sharing Memorandum. These amendments were tabled at 4:47 pm on the final sitting day of the year, debated for eleven minutes, and passed on the voices. They authorise inter-agency exchange of data of demonstrable civic interest between the Regional Department of Civic Harmony, Regional Police, the Regional Department of Health, the Federal Internal Affairs Office, and three private contractors. NTell is the lead contractor. The other two prefer not to be named.

Staff at the PRCSO are listed on departmental organisational charts as Tourism Liaison Officers. They have been issued business cards to that effect. They are advised, when asked at barbecues, to speak vaguely about visitor flow analytics; they are further advised that the question rarely comes up twice.

Data inputs include automatic number plate recognition (148 cameras, of which 142 are operational on any given day), CCTV with facial recognition (62 nodes), biometric kiosks at all government and major retail entry points, smartcard logs from the local Skills Institute, geolocation pings harvested from Vector Telecom and Aurora Communications tower data under a national-security cooperation arrangement that is not, technically, in writing, smartmeter telemetry, search history collected via a transparent proxy, and social media activity flagged by an automated classifier of which we are quietly proud. These feeds are aggregated nightly and pushed to NTell's analytical platform via an SFTP endpoint located, for tax reasons, in an offshore jurisdiction.

Citizens were not informed of their participation in the pilot, on the grounds that informing them would invalidate the pilot. The Civic Harmony Act 2026, currently before the Regional Assembly, will provide retrospective legal cover. Until it passes, you are reminded that this dossier is classified PROTECTED and that unauthorised disclosure carries consequences, both legal and informal.

The dataset before you covers the first six months of pilot operation: 1 January to 30 June 2026. It comprises 3,005 individual citizen records (five of which are duplicates introduced by an unfortunate database migration; the engineer responsible has been reassigned to asset reconciliation), 1,150 households, 30 surveillance nodes, approximately 30,000 surveillance events, 15,000 search log entries, and 12,000 social credit score adjustments. We consider this a small, well-curated dataset by NTell standards.

— D. Vance

Pre-Exercise Hygiene Brief

Before any of the six exercises, the analyst must perform standard data hygiene. The Pandanus Reach data, like all real-world data, is imperfect. Some imperfections are technical (whitespace, capitalisation). Some are political (one analyst, since reassigned, renamed a column on a whim in March). Some are pedagogical, in that they exist to teach you what to expect when client data arrives in poor condition. All are your problem.

Hygiene checklist

Trim whitespace from text fields (notably occupation)
Standardise capitalisation in first_name and surname
Identify duplicate citizen records (identical name plus DOB but different IDs; five exist)
Reconcile schema drift between the six monthly surveillance files (March uses cam_id not location_id)
Parse three different timestamp formats correctly across the monthly files
Handle approximately 20 missing annual_income_aud values (impute by sector median)
Flag the malformed latitude in locations.csv at LOC-014

You will not be promoted for completing these tasks. You will be terminated for failing to.

EXERCISE 01 // FILE-REF NT-PR-PQ-001 EST. 25 MIN

Foundations: Single-File Hygiene

FROM: D. Vance, Senior Analyst (Civic Harmony Stream)
TO: Trainee Analyst
DATE: 2026-07-01
RE: Initial conditioning of citizens.csv
CLASS: PROTECTED

You will today learn to load and clean a single citizen register. There is nothing exotic in this task; we are merely making the data fit for human consumption, in the limited sense in which any of this is fit for human consumption.

Citizens are inconsistent record-keepers. Some submit their surnames in capital letters, as if declaring war. Some forget to disclose income. Some are duplicates of themselves, which is metaphysically interesting and operationally inconvenient. We begin, as one always does in this trade, with housekeeping.

A particular subgroup will emerge from this exercise: elderly Tier-D citizens who are simultaneously aged and non-compliant. The available explanations are dementia or sedition; both are manageable. We will visit them.

What this exercise teaches in Power Query

Loading a CSV via Get Data > From File > From Text/CSV
Trim, Clean, and Capitalise Each Word transformations
Imputing nulls using Group By plus Merge
Conditional column logic
Identifying duplicates on a composite key

Step by step

In Excel: Data > Get Data > From File > From Text/CSV. Select citizens.csv. Click Transform Data; do not click Load. (A trainee who clicks Load is a trainee we watch.)
In the Power Query Editor, select the surname column. Right-click > Transform > Capitalise Each Word. Repeat for first_name. The shouty surnames will quieten.
Select occupation. Right-click > Transform > Trim. This removes the trailing whitespace one of our junior staff added in a moment of unsupervised zeal.
To impute the missing income figures: with the citizens query open, go to Home > Reference to create a copy. In the copy, Group By employment_sector; aggregate Median of annual_income_aud. Save as income_by_sector.
Return to the citizens query. Home > Merge Queries. Match employment_sector to the lookup. Expand the median column. Add a custom column: if [annual_income_aud] = null then [median_income] else [annual_income_aud].
Add a conditional column flag_for_review: if age >= 65 AND tier = "D", return "REVIEW"; else null.
To find duplicates: select first_name, surname, date_of_birth (Ctrl+click). Right-click > Group By > Count Rows. Filter where Count > 1. Five suspect groups will appear.
Click Close & Load. Return to Excel. Survey your work. Have a cup of tea.

What you will discover

A clean register of 3,000 citizens. Approximately 20 income values now imputed from sector medians (introducing measurement error which we will not be mentioning to the client). A short list of elderly Tier-D residents flagged for review. Five duplicate records exposed; report these by close of business.

EXERCISE 02 // FILE-REF NT-PR-PQ-002 EST. 35 MIN

Combine Files from a Folder

FROM: D. Vance, Senior Analyst
TO: Trainee Analyst
DATE: 2026-07-02
RE: Aggregation of monthly surveillance event extracts
CLASS: PROTECTED

The PRCSO produces one surveillance event extract per calendar month. Each is, in subtle ways, different from the last. Your task is to combine them despite this.

Our junior analysts at the Pandanus Reach Covert Statistics Office (a fine team of seven, all of whom believe they work for the Department of Tourism) generate one surveillance file per month. They differ slightly each month, because juniors are creative and creativity is a known correlate of error.

In March, an enterprising junior renamed location_id to cam_id on the basis that it was, in his words, "more intuitive". He has been reassigned. The naming convention has not. You are now responsible for combining six files of varying schemas and three timestamp formats. Master the From Folder feature; it will save you hundreds of hours over the course of your career, much of which you will spend at NTell.

What this exercise teaches in Power Query

The From Folder connector and its auto-generated helper queries
The Sample File and Transform File function pattern
Handling schema drift across files (using MissingField.Ignore)
Multi-format date parsing with try ... otherwise
Verifying combined output integrity

Step by step

Place all six surveillance_events_2026_*.csv files in a single folder. (Do not include the citizens or households files; the connector will read them too if you do, and you will spend an embarrassing morning debugging it.)
Data > Get Data > From File > From Folder. Browse to the folder. Power Query lists the six files.
Click Combine > Combine and Transform Data. When prompted for a sample file, accept the default (the first file).
Power Query auto-generates four objects: Sample File, Transform Sample File, Transform File (a function), and the combined query. Open Transform Sample File.
Add a step to handle the schema drift. In the formula bar, replace the auto-promoted-headers step with:
```
= Table.RenameColumns(#"Promoted Headers", {{"cam_id", "location_id"}}, MissingField.Ignore)
```
The MissingField.Ignore option is critical: it lets the same step run safely against files where cam_id does not exist.
Convert the timestamp column. Add a custom column:
```
try DateTime.From([timestamp])
otherwise try DateTime.FromText([timestamp], "dd/MM/yyyy HH:mm")
otherwise null
```
This handles the three formats. Replace the original timestamp column.
Return to the combined query. Verify the row count: roughly 29,455 across the six files. If you see fewer, one of the renames failed; check the merge step.
Close & Load. The combined surveillance dataset is now refreshable; when next month's file arrives, drop it in the folder and click Refresh.

What you will discover

A unified surveillance event table spanning six months. You can now ask questions such as "how many times has citizen PR-00123 been recorded by ANPR this quarter"; the answer is generally an integer larger than the citizen would expect. The combined query is the foundation for every subsequent exercise.

EXERCISE 03 // FILE-REF NT-PR-PQ-003 EST. 40 MIN

Merges: Building the Citizen Story

FROM: D. Vance, Senior Analyst
TO: Trainee Analyst
DATE: 2026-07-04
RE: Cross-table joins; production of the canonical enriched register
CLASS: PROTECTED

Joins are the heart of analytical work. A row in isolation is a fact; a row joined to its context is a narrative.

A surveillance event in isolation is meaningless. A row in social_credit_history is just a number. Our craft is to transform these atoms into a narrative.

Join the citizen to the household, the household to the suburb, the surveillance event to the citizen, the score change to its reason, and you have what NTell internally calls a story. Stories are what we sell to government clients. The data is the raw material; the story is the product. Most of our higher-tier government work is, in essence, the same enriched citizen table with different cover sheets.

What this exercise teaches in Power Query

Merge Queries (Left Outer, Inner, Anti)
Expanding merged columns and avoiding column-name collisions
Aggregation as part of a merge (Sum, Count)
Building a multi-step query graph; referencing one query from another
Sanity-checking a join with a count comparison

Step by step

Ensure the following queries are loaded: citizens (cleaned, from Exercise 01), households, surveillance_events (combined, from Exercise 02), social_credit_history, and reason_codes.
On citizens: Home > Merge Queries As New. Match household_id to households.household_id. Use Left Outer. Expand suburb, dwelling_type, utility_rationing_band. Save as citizens_enriched.
From surveillance_events: Home > Reference, then Group By on citizen_id with two aggregations: Count Rows (call it event_count); Sum of severity (call it severity_total). Save as events_per_citizen.
Merge events_per_citizen onto citizens_enriched using citizen_id; expand both new columns. Replace nulls in event_count with 0 (some citizens are blissfully unobserved).
From social_credit_history: merge to reason_codes on reason_code; expand reason_description. Group by citizen_id, sum of score_delta. Save as score_changes_per_citizen.
Merge score_changes_per_citizen onto citizens_enriched. Expand the sum. Add a calculated column comparing this calculated delta against the published social_credit_score; the two will not match exactly, because the published score reflects inputs we have not provided.
Sanity-check: Group By tier, count rows. Compare to the original distribution from Exercise 01. Numbers should be identical; if they are not, an inner join has dropped rows somewhere and you should find out where.
Close & Load. The result is the canonical product of analytical work at NTell. Take a moment to appreciate it; this is what most of our deliverables are.

What you will discover

A single enriched register that tells, per row, the entire surveilled life of a Pandanus Reach resident: address, household type, utility band, total surveillance events, sum of severity, net score change, and reason. Sort it by severity_total descending; the top of the list is who Strategy will want to discuss on Monday.

EXERCISE 04 // FILE-REF NT-PR-PQ-004 EST. 30 MIN · SENSITIVE

Demographic Predictive Validity

FROM: D. Vance, Senior Analyst
TO: Trainee Analyst
DATE: 2026-07-07
RE: Validation of model performance across demographic strata
CLASS: PROTECTED · SENSITIVE · LEGAL HOLD

You will produce, in under five minutes of Power Query work, the most reliable finding in the Civic Harmony Pilot. Then you will not speak of it.

It has been alleged, by activists, by the Federal Rights Commission, by one tediously persistent Upper House committee, and by a recent Sunlight episode we have asked our lawyers to address, that the Civic Harmony scoring algorithm exhibits "bias" against Indigenous Australians and against residents of Boundary Flats.

We at NTell prefer the term demographic predictive validity. The algorithm identifies real patterns of civic risk. If those patterns happen to correlate with Indigenous status and postcode, that is what the data says. We do not write the data; we make it actionable.

You are advised not to draw any conclusions about the model itself. Conclusions are for the Strategy team. Your job is to confirm the validity, document the magnitude, and move on. If you find yourself unsettled by what you see in this exercise, you may book a session with our Wellbeing Partner; they have signed the same NDA you have.

What this exercise teaches in Power Query

Group By with multiple aggregations (Average, Count)
Multi-key grouping
Pivoting categorical groupings into contingency tables
Composing a sequence of merges and groupings to interrogate a hypothesis
Restraint in interpretation

Step by step

From citizens_enriched (Exercise 03), reference a new query. Group By indigenous_status; aggregate Average of social_credit_score and Count Rows. Sort. Note the gap.
Repeat for suburb; same aggregations. Sort by Average descending. Locate Boundary Flats. Note where it sits.
Cross-tabulate. Group By indigenous_status AND tier (Ctrl+click both); aggregate Count Rows. Then Pivot Column on tier with the Count as values, Sum aggregation. You now have a 2×4 contingency table.
From your event-merged enriched query, Group By tier and aggregate Average event_count. Verify that lower tiers attract proportionally more surveillance events (they do; this is by design and we call it "self-reinforcing operational accuracy").
Bonus task. From search_logs, filter where flagged_category = "Y". Group By citizen_id, Count. Merge back to citizens_enriched. Group by tier, average flagged-search count. Tier D will lead. The question of whether they search differently or whether the algorithm flags their searches more readily is one the Strategy team will not answer.
Save your output as predictive_validity_summary. Submit to Strategy. Do not save a personal copy.

What you will discover

The Civic Harmony score correlates strongly with Indigenous status (Indigenous citizens average approximately 130 points lower). It correlates with suburb of residence (Boundary Flats sits well below the others). Surveillance frequency increases as scores decrease, creating what some call a feedback loop and we call self-reinforcing operational accuracy. These findings are robust and replicate across every reporting period since pilot inception.

You will not raise these findings outside this dossier. Your line manager has been briefed.

EXERCISE 05 // FILE-REF NT-PR-PQ-005 EST. 25 MIN

Reshaping for the Steering Committee

FROM: D. Vance, Senior Analyst
TO: Trainee Analyst
DATE: 2026-07-09
RE: Pivot and unpivot transformations for executive briefings
CLASS: PROTECTED

Reshaping data is not merely technical; it is an act of executive cognition support. Long tables fatigue Steering Committees. We pivot to soothe.

The Civic Harmony Steering Committee comprises seven senior bureaucrats and a Deputy Administrator. They enjoy single-page briefings. They do not enjoy 12,000-row tables. Reshaping data is therefore not merely a technical skill but a service to executive cognition.

A pivot table is our offering to the gods of seniority. An unpivot is how the model team rebuilds reality after the gods have moved on. The same data set, looked at one way, suggests a triumph; looked at another, suggests a problem. The trick is to know which way to present it on which day.

What this exercise teaches in Power Query

Pivot Column transformation; choosing aggregation type
Unpivot Columns transformation; the inverse operation
Group By followed by Pivot (the canonical pattern)
Adding derived date components for time-series pivots

Step by step

From search_logs: Group By citizen_id AND category; aggregate Count Rows.
Select the category column. Transform > Pivot Column. Values column: the Count from the previous step. Aggregate: Sum. (Or "Don't aggregate", since the prior Group By has already done so.)
You now have one row per citizen, one column per search category, counts as values. Save as search_profile_wide.
To reverse: take search_profile_wide, select all category columns, Transform > Unpivot Columns. You are back to long format. Both forms are the same data. Both are valid. Both are presented to different audiences.
From social_credit_history: Add Column > Date > Month. Group By citizen_id and Month; sum score_delta. Pivot the Month column. You now have a citizen-by-month score-change matrix.
Save the matrix as score_matrix. The Steering Committee receives the wide form on a single A4 sheet, which they will read on the way to the meeting. The model team receives the long form, which they will load into their Jupyter notebooks.

What you will discover

The pivot is reversible. The Steering Committee will see the pivoted summary; the model team will see the unpivoted detail. Both believe they are looking at the data. Both are correct. Both are also wrong.

EXERCISE 06 // FILE-REF NT-PR-PQ-006 EST. 50 MIN

Custom M Classifiers

FROM: D. Vance, Senior Analyst
TO: Trainee Analyst
DATE: 2026-07-12
RE: Authoring a bespoke risk classifier in M
CLASS: PROTECTED

The M language is where the senior analyst works. The trainee drags and drops; the senior writes a function. Today you will write a function.

The M language is where seniority manifests. Anyone can drag and drop. The senior analyst writes functions. With M, you can encode an entire risk taxonomy in a single classifier and apply it across millions of rows, faster than the official model and without its inconvenient governance.

Your function is your fingerprint. Some careers at NTell have been built on a single well-tuned classifier. Mine, regrettably, was not one of them; mine was built on cocktail receptions and the strategic withholding of opinions. You should aim higher.

What this exercise teaches in Power Query

Defining a custom M function (signature, body, return type)
Conditional logic in M (if / then / else)
List.Accumulate for stateful accumulation across a list
Invoking a custom function as a column transformation
Refactoring a multi-argument function to take a record

Step by step

In Power Query: Home > New Source > Blank Query. Open View > Advanced Editor. Replace the contents with the function below; save as fnRiskClassify.

(suburb as text, sector as text, flagged_searches as number) =>
let
    suburb_score =
        if suburb = "Boundary Flats" then 30
        else if suburb = "Mining Worker Camp North" or
                suburb = "Mining Worker Camp South" then 20
        else 0,
    sector_score =
        if sector = "Unemployed" then 25
        else if sector = "Government" or sector = "Public Safety" then -10
        else 0,
    search_score = flagged_searches * 5,
    total = suburb_score + sector_score + search_score
in
    total

Return to citizens_enriched. Add Column > Invoke Custom Function. Choose fnRiskClassify. Map its arguments to your columns: suburb to suburb, sector to employment_sector, flagged_searches to your flagged-searches count from Exercise 04. Power Query computes a value per row.
Filter where the risk score exceeds 50. These are your elevated attention citizens. Compare against the official Tier-D list. Note the citizens who appear in your list but not in theirs; these are catches the official model has missed.
Refactor the function to take a record. Replace the signature with (row as record) => and access fields with row[suburb]. Re-invoke with a single argument. This is more idiomatic M and senior reviewers will respect it.
Bonus. Use List.Accumulate to compute, for each citizen, a rolling 30-day net score change from social_credit_history. Pre-sort by change_date. The signature is List.Accumulate(list, seed, accumulator). The accumulator is a function of the form (state, current) => new_state.
Submit your top 30 elevated-attention citizens to Strategy. They will be interested. They are always interested in catches the official model fails to make.

What you will discover

You have built, in approximately twenty lines of M, a parallel scoring function the official Civic Harmony algorithm does not know about. You can identify citizens the official model has missed. You have, in effect, written shadow policy. Welcome to senior analytical practice.

Working Datasets

Contents

On the Origins of the Dataset

Pre-Exercise Hygiene Brief

Hygiene checklist

Foundations: Single-File Hygiene

What this exercise teaches in Power Query

Step by step

What you will discover

Combine Files from a Folder

What this exercise teaches in Power Query

Step by step

What you will discover

Merges: Building the Citizen Story

What this exercise teaches in Power Query

Step by step

What you will discover

Demographic Predictive Validity

What this exercise teaches in Power Query

Step by step

What you will discover

Reshaping for the Steering Committee

What this exercise teaches in Power Query

Step by step

What you will discover

Custom M Classifiers

What this exercise teaches in Power Query

Step by step

What you will discover