Lab II

Overview

This lab provides a practical overview of how satellite image embeddings, compact numerical representations of imagery, can be used as a general-purpose way of describing geographic space.

The lab begins with unsupervised analysis, where embeddings are used to explore and structure geographic space without predefined labels. This demonstrates how they support exploratory analysis and the identification of place-based patterns. The same embeddings are then reused in a predictive context, showing how a single representation can support multiple analytical tasks without rebuilding the feature pipeline.

The emphasis throughout is on reusability and judgement:

  • how embeddings enable faster and more flexible exploration,
  • how they integrate into standard analytical workflows,
  • when they meaningfully improve predictive performance,

The lab is not about training complex models, but about understanding how and when embeddings add value in spatial analysis.

Learning outcomes

By the end of the lab, participants will be able to:

  • Interpret satellite image embeddings as representations of geographic space.
  • Develop and evaluate predictive models using embeddings.
  • Critically assess when embeddings improve analytical or predictive performance.

Data

Embedding data

  • Satellite image embeddings derived from Google imagery for London (2024).
  • Embeddings are aggregated to Lower-layer Super Output Area (LSOA) level using the mean value for each embedding dimension.
  • The final dataset contains 64 embedding variables (A00_mean to A63_mean).
  • Provided as a GeoPackage: uk_lsoa_london_embeds_2024.gpkg.

Socio-demographic data

  • Index of Multiple Deprivation (IMD) – IMD 2025 provides a relative measure of deprivation across 32,844 LSOAs in England. – Values are reported in deciles (10% bands from most to least deprived). – Source: Ministry of Housing, Communities and Local Government GeoPortal: https://communitiesopendata-communities.hub.arcgis.com/
  • Socio-economic and population characteristics (Census 2021) – Percentage of population aged 16+ with no educational qualifications – Percentage reporting bad or very bad health – Population density (persons per km²) – Percentage of single-parent households – Source: NOMIS: https://www.nomisweb.co.uk/sources/census_2021_bulk
  • All socio-demographic variables have been linked at LSOA level and provided in Socioeconomic.csv, located in the data folder

Import Libraries

Before working with the data, a number of Python libraries are required to support data handling, spatial analysis, and modelling tasks in this lab.

These libraries provide functionality for: - reading and manipulating tabular and spatial data - performing numerical and statistical operations - visualising spatial patterns - applying clustering and predictive models

All code in this lab assumes that these libraries are available in the working environment and have been imported prior to the analytical steps.

Code
# -----------------------------------------
# Core libraries
# -----------------------------------------

# Spatial data handling
import pandas as pd              # Tabular data manipulation
import geopandas as gpd          # Vector GIS data and geometry handling
import json                      # JSON handling

# -----------------------------------------
# Local functions
# -----------------------------------------

# Import specific functions from local module
from tools import (
    filter_table,
    get_embedding_cols,
    kmeans_clustering,
    show_cluster_labels,
    plot_simple_map,
    parse_reference_points,
    make_webmap_general,
    closest_lsoas_to_cluster,
    map_closest_lsoas,
    plot_embedding_distances,
    run_rf_classifier,
    plot_feature_importance,
    classify_for_mapping,
    show_two_maps_side_by_side,
)

Load data

Import the dataset into the working environment so it can be processed and analysed

Code
# Load the embedding data from file
file_path = "uk_lsoa_london_embeds_2024.geojson"

# Read the GeoJSON file and convert it into a GeoDataFrame
with open(file_path, "r") as f:
    gdf = gpd.GeoDataFrame.from_features(json.load(f))

Display information about data

Check that the data has loaded successfully by viewing basic information and sample records.

Code
# Display coordinate reference system (CRS)
# This confirms the spatial reference used by the dataset
print(f"The CRS for this dataset is: {gdf.crs}")

# Preview the data
# Display the first few rows to inspect structure and attributes
gdf.head()

Preprocessing

Reduce the dataset to only the embedding variables and key identifiers (e.g. LSOA codes/names),while retaining geometry. This ensures the analysis focuses on the embedding featuresand keeps the data suitable for spatial mapping.

Code
# Filter dataset to retain only relevant variables
# Keeps identifier fields, embedding features, and geometry
gdf = filter_table(gdf)

# Preview the filtered dataset
# Confirm that only the required columns remain
gdf.head()

The filtered dataset no longer includes unnecessary attributes (e.g. ‘dzcode’). This reduces the number of variables and ensures the analysis focuses on the embedding features and key identifiers.

Predictive modelling

Uses existing data to train a model that estimates or classifies an outcome for new locations, allowing patterns learned from embeddings and other features to be applied beyond the observed data.

Load socioeconomic data

This dataset includes the Index of Multiple Deprivation (IMD) deciles, along with additional variables identified in the literature as influencing deprivation. IMD deciles group areas into ten categories, where lower deciles indicate higher levels of deprivation and higher deciles indicate lower levels of deprivation.

Code
# Load data
df = pd.read_csv('Socioeconomic.csv')

Display the data

Code
df

Add socioeconomic data to our embedding data

The two datasets are linked using the common identifier LSOA21CD, allowing the embedding variables to be combined with the socioeconomic variables for each LSOA.

Code
# Join embedding data with socioeconomic data
# Merge on the common LSOA identifier to combine both datasets
gdf2 = gdf.merge(
    df,
    on="LSOA21CD",   # Shared key between datasets
    how="left"       # Keep all rows from gdf (embeddings), add matching socioeconomic data
)

Display updated data

Confirm that embedding and socioeconomic variables have been successfully merged.

Code
# Display updated data
gdf2

Display the IMD data

Visualising the IMD data on a map helps reveal the spatial distribution of deprivation across London and makes patterns easier to interpret than in tabular form.

Code
# -----------------------------------------
# Interactive map of IMD deciles
# -----------------------------------------

# Add reference points (Name, latitude, longitude)
# These provide spatial context and help orient the map
pois = """
"Westminster", 51.4975, -0.1357
"City of London", 51.5155, -0.0922
"Canary Wharf", 51.5054, -0.0235
"King's Cross", 51.5308, -0.1238
"Heathrow Airport", 51.4700, -0.4543
"WE ARE HERE :)", 51.4962, -0.1298
"""

# Ensure the GeoDataFrame has a coordinate reference system (CRS)
# Use EPSG:4326 only if the coordinates are already longitude/latitude
if gdf2.crs is None:
    gdf2 = gdf2.set_crs(epsg=4326)

# Create interactive map
# Visualises IMD deciles across LSOAs with tooltips and optional basemaps
m = make_webmap_general(
    focus_gdf=gdf2,
    focus_col="IMD_decile",
    focus_name="LSOA IMD Deciles",
    focus_tooltip_cols=("LSOA21CD", "LSOA21NM", "IMD_decile"),
    focus_categorical=True,
    focus_legend=True,
    focus_cmap="YlOrRd_r",
    focus_style_kwds={
        "fillOpacity": 0.6,
        "weight": 0.2,
        "color": "black",
    },
    context_gdf=None,
    pois=pois,
    fit_to="focus",
    zoom_start=10,
)

# Display map
m

Create a model to predict IMD using ONLY socioeconomic variables

The model uses variables identified in previous studies as predictors of deprivation, including: - Percentage with no qualifications (age 16+) - Percentage reporting bad or very bad health - Population density (per km²) - Percentage of lone-parent households

A Random Forest model is used, which is an ensemble of decision trees. It can capture non-linear relationships and interactions between variables without requiring assumptions such as proportional odds used in ordinal logistic regression.

Code
# -----------------------------------------
# Run Random Forest model (socioeconomic variables only)
# -----------------------------------------

# Train a classifier to predict IMD decile using selected socioeconomic predictors
results = run_rf_classifier(
    data=gdf2,
    y_col="IMD_decile",  # Target variable (deprivation decile)
    x_cols=[
        "Percent no qualifications 16 and over",
        "Percent bad and very band health",
        "Population density per km",
        "Percent lone family household",
    ],  # Predictor variables
)

Create a model to predict IMD using ONLY embedding variables

Code
# -----------------------------------------
# Run Random Forest model (embedding features only)
# -----------------------------------------

# Define embedding feature columns (A00_mean to A63_mean)
feature_cols = [f"A{i:02d}_mean" for i in range(64)]

# Train classifier to predict IMD decile using embedding features
results = run_rf_classifier(
    data=gdf2,
    y_col="IMD_decile",      # Target variable (deprivation decile)
    x_cols=feature_cols,     # Embedding predictors
    test_size=0.2,           # Proportion of data used for testing
    random_state=42,         # Ensures reproducibility
    n_estimators=500,        # Number of trees in the forest
    class_weight="balanced", # Handle class imbalance
)

Create a model to predict IMD using both socioeconomic and embedding variables

Code
# -----------------------------------------
# Run Random Forest model (embeddings + socioeconomic variables)
# -----------------------------------------

# Define predictor variables
# Combine embedding features with selected socioeconomic indicators
feature_cols = (
    [f"A{i:02d}_mean" for i in range(64)] +
    [
        "Percent no qualifications 16 and over",
        "Percent bad and very band health",
        "Population density per km",
        "Percent lone family household",
    ]
)

# Train classifier to predict IMD decile
results = run_rf_classifier(
    data=gdf2,
    y_col="IMD_decile",      # Target variable (deprivation decile)
    x_cols=feature_cols,     # Combined predictors
    test_size=0.2,           # Proportion used for testing
    random_state=42,         # Reproducibility
    n_estimators=500,        # Number of trees
    class_weight="balanced", # Handle class imbalance
)

Plotting the top 15 variables

Code
# -----------------------------------------
# Plot feature importance
# -----------------------------------------

# Visualise the most important predictors from the Random Forest model
plot_feature_importance(
    results["importances"],  # Feature importance scores from the model
    top_n=15,                # Display top 15 features
    title="Top 15 Features Predicting IMD Decile",
)

Compare variables - side by side

Code
# -----------------------------------------
# Compare two variables side by side on maps
# -----------------------------------------

# Visualise IMD deciles alongside a socioeconomic variable
# Both variables are displayed using 10 quantile-based classes for comparability
show_two_maps_side_by_side(
    gdf2,
    left_var="IMD_decile",                        # Deprivation (target variable)
    right_var="Percent no qualifications 16 and over",  # Socioeconomic predictor
    tooltip_cols=("LSOA21CD", "LSOA21NM"),        # Information shown on hover
    scheme="quantile",                            # Classification method
    k=10,                                         # Number of classes
)
Code
# -----------------------------------------
# Compare two variables using equal intervals
# -----------------------------------------

# Visualise IMD deciles alongside a socioeconomic variable
# Values are grouped into 10 equal-width intervals for comparison
show_two_maps_side_by_side(
    gdf2,
    left_var="IMD_decile",                         # Deprivation (target variable)
    right_var="Percent no qualifications 16 and over",  # Socioeconomic predictor
    scheme="equal",                                # Equal-interval classification
    k=10,                                          # Number of classes
    tooltip_cols=("LSOA21NM",),                    # Information shown on hover
)
Code
# -----------------------------------------
# Compare two variables using natural breaks
# -----------------------------------------

# Visualise IMD deciles alongside a socioeconomic variable
# Values are grouped using natural breaks (Jenks), which highlight
# inherent groupings in the data
show_two_maps_side_by_side(
    gdf2,
    left_var="IMD_decile",                         # Deprivation (target variable)
    right_var="Percent no qualifications 16 and over",  # Socioeconomic predictor
    scheme="natural",                              # Natural breaks (Jenks)
    k=10,                                          # Number of classes
    tooltip_cols=("LSOA21NM",),                    # Information shown on hover
)

YOUR TURN: Choose the most influential embedding variable and visually compare it to the IMD variable to explore their spatial relationship, and briefly interpret any patterns or contrasts observed.