Technical Approach

From data collection to visualization: our comprehensive pipeline, interactive tools, and quality assurance framework

Explore the Pipeline Explore Visualization Explore Data Quality

Data Processing Pipeline

A sophisticated multi-stage system for vacancy prediction and visualization

Building Stories Technical Architecture Diagram

Pipeline Stages

1

Fetch Stage

Gathers datasets from local authorities, GeoDirectory, and various geospatial repositories, handling format standardisation and updating of data.

Data Sources:

  • Local authorities across Ireland
  • GeoDirectory (national address database)
  • Census data and government repositories
  • Geospatial repositories and remote sensing data

Key Functions:

  • Multi-source data collection and integration
  • Format standardization and validation
  • Automated data refresh and versioning
  • Quality control and error handling
2

Ingest Stage

Cleans, joins, and standardises data into a shared coordinate system. Supports multiple aggregation functions and manages version-specific custom processing for different datasets.

Key Functions:

  • Data cleaning and quality validation
  • Data joining across multiple sources
  • Multiple aggregation functions supports
  • Dataset-specific preprocessing pipelines
  • Feature engineering and data enrichment
  • Temporal alignment and synchronization
3

Partitioned Database

PostgreSQL-based centralized storage with PostGIS extension, optimized for geospatial queries and machine learning operations.

Technology Stack:

  • Database: PostgreSQL with PostGIS extension for geospatial data
  • Bronze Store: Separate PostgreSQL database storing ingested data from the pipeline
  • Silver Store: Separate PostgreSQL database storing prediction results

Key Functions:

  • Geospatial data storage and indexing with PostGIS
  • Spatial partitioning for query optimization
  • Multi-dimensional indexing for high-performance retrieval
  • Automated periodic data backup processes
  • Scalable storage architecture
4

Vacancy Prediction Model

The Building Stories Model, which is retrained monthly, can predict whether a building is vacant or not and generate a confidence score associated with this prediction. Model predictions are updated in the database weekly when new data is ingested.

Model Architecture:

  • Algorithm: XGBoost (Extreme Gradient Boosting)
  • Input Features: Multi-modal data including census statistics, property records, geospatial attributes, and remote sensing data
  • Output: Binary vacancy classification with confidence scores (0-1)
  • Training: Monthly retraining with updated ground truth data
  • Inference: Weekly batch predictions for all buildings
  • Performance Metrics: Precision, recall, F1-score, and AUC-ROC
Vacancy Predictions Map
5

Upload Stage

Pulls relevant data from the partitioned database and orchestrates upload of this data to the visualization interface for end-user access.

Key Functions:

  • User-specific data filtering and access control
  • Data transformation for visualization requirements
  • API integration with visualization platforms
  • Real-time data synchronization

Visualization Tools

Interactive platform for exploring and analyzing building vacancy data

Building Stories Visualization Tool Interface

Interactive Data Platform

Built using ArcGIS Experience Builder Developer Edition with ReactJS, the platform provides role-based access where users gain access to different datasets and vacancy prediction models trained on these datasets based on their credentials.

User Capabilities:

  • Create custom queries to filter and analyse data on the built environment
  • See the vacancy prediction score for each building
  • View integrated Mapillary street-view imagery
  • Access metadata detailing data quality and processing approaches

Technologies:

ArcGIS Experience Builder Developer Edition ReactJS Role-based Access Mapillary Integration

Data Quality & Metadata

Comprehensive data governance and quality assurance throughout the pipeline

Data Profiling & Quality Assurance

Logging and data profiling takes place at the start and conclusion of each stage of the processing pipeline. Quality checks, reports, and records are stored in RDF (Resource Description Framework), a standard machine-readable format, and uploaded to a metadata repository.

Automated Quality Checks

Continuous validation of data integrity, completeness, and consistency throughout the pipeline

RDF-based Metadata Storage

Quality checks, reports, and records stored in RDF format for standardized, machine-readable documentation

Version Control

Complete lineage tracking of data versions and processing stages

Data Governance

CKAN and Knowledge Graphs for transparent data management and accessibility

CKAN Data Catalog Platform