Technical Approach

Pipeline Stages

Fetch Stage

Gathers datasets from local authorities, GeoDirectory, and various geospatial repositories, handling format standardisation and updating of data.

Data Sources:

Local authorities across Ireland
GeoDirectory (national address database)
Census data and government repositories
Geospatial repositories and remote sensing data

Key Functions:

Multi-source data collection and integration
Format standardization and validation
Automated data refresh and versioning
Quality control and error handling

Ingest Stage

Cleans, joins, and standardises data into a shared coordinate system. Supports multiple aggregation functions and manages version-specific custom processing for different datasets.

Key Functions:

Data cleaning and quality validation
Data joining across multiple sources
Multiple aggregation functions supports
Dataset-specific preprocessing pipelines
Feature engineering and data enrichment
Temporal alignment and synchronization

Partitioned Database

PostgreSQL-based centralized storage with PostGIS extension, optimized for geospatial queries and machine learning operations.

Technology Stack:

Database: PostgreSQL with PostGIS extension for geospatial data
Bronze Store: Separate PostgreSQL database storing ingested data from the pipeline
Silver Store: Separate PostgreSQL database storing prediction results

Key Functions:

Geospatial data storage and indexing with PostGIS
Spatial partitioning for query optimization
Multi-dimensional indexing for high-performance retrieval
Automated periodic data backup processes
Scalable storage architecture

Vacancy Prediction Model

The Building Stories Model, which is retrained monthly, can predict whether a building is vacant or not and generate a confidence score associated with this prediction. Model predictions are updated in the database weekly when new data is ingested.

Model Architecture:

Algorithm: XGBoost (Extreme Gradient Boosting)
Input Features: Multi-modal data including census statistics, property records, geospatial attributes, and remote sensing data
Output: Binary vacancy classification with confidence scores (0-1)
Training: Monthly retraining with updated ground truth data
Inference: Weekly batch predictions for all buildings
Performance Metrics: Precision, recall, F1-score, and AUC-ROC

Upload Stage

Pulls relevant data from the partitioned database and orchestrates upload of this data to the visualization interface for end-user access.

Key Functions:

User-specific data filtering and access control
Data transformation for visualization requirements
API integration with visualization platforms
Real-time data synchronization

Visualization Tools

Interactive platform for exploring and analyzing building vacancy data

Building Stories Visualization Tool Interface

Interactive Data Platform

Built using ArcGIS Experience Builder Developer Edition with ReactJS, the platform provides role-based access where users gain access to different datasets and vacancy prediction models trained on these datasets based on their credentials.

User Capabilities:

Create custom queries to filter and analyse data on the built environment
See the vacancy prediction score for each building
View integrated Mapillary street-view imagery
Access metadata detailing data quality and processing approaches

Technologies:

ArcGIS Experience Builder Developer Edition ReactJS Role-based Access Mapillary Integration

Data Quality & Metadata

Comprehensive data governance and quality assurance throughout the pipeline

Data Profiling & Quality Assurance

Logging and data profiling takes place at the start and conclusion of each stage of the processing pipeline. Quality checks, reports, and records are stored in RDF (Resource Description Framework), a standard machine-readable format, and uploaded to a metadata repository.

Automated Quality Checks

Continuous validation of data integrity, completeness, and consistency throughout the pipeline

RDF-based Metadata Storage

Quality checks, reports, and records stored in RDF format for standardized, machine-readable documentation

Version Control

Complete lineage tracking of data versions and processing stages

Data Governance

CKAN and Knowledge Graphs for transparent data management and accessibility

Data Processing Pipeline

Pipeline Stages

Fetch Stage

Data Sources:

Key Functions:

Ingest Stage

Key Functions:

Partitioned Database

Technology Stack:

Key Functions:

Vacancy Prediction Model

Model Architecture:

Upload Stage

Key Functions:

Visualization Tools

Interactive Data Platform

User Capabilities:

Technologies:

Data Quality & Metadata

Data Profiling & Quality Assurance

Automated Quality Checks

RDF-based Metadata Storage

Version Control

Data Governance