Skip to content

Architecture

This document describes the internal architecture of DataSluice.

Overview

DataSluice is built around a layered architecture that separates domain models, transport concerns, and portal-specific adapters. The goal is to provide a single, consistent interface across heterogeneous open-data portals.

┌──────────────────────────────────────────────────────┐
│                   Public API                         │
│            datasluice.DataSluice                      │
├──────────┬──────────┬──────────┬─────────────────────┤
│   CLI    │ Pandas/  │  dlt /   │   Airflow / DuckDB  │
│ (Typer)  │  Polars  │ DuckDB   │   integrations      │
├──────────┴──────────┴──────────┴─────────────────────┤
│                  Adapters Layer                        │
│   base · registry · factory                            │
├─────────┬──────────┬──────────┬──────────────────────┤
│  CKAN   │ data.gouv│ Socrata  │  Custom adapters      │
├─────────┴──────────┴──────────┴──────────────────────┤
│                Domain Models                           │
│  Dataset · Resource · Organization · License · Query  │
├───────────────────────────────────────────────────────┤
│              Cross-cutting Concerns                    │
│  transport · auth · discovery · io · formats · config │
└───────────────────────────────────────────────────────┘

Layers

1. Domain Models (datasluice.domain)

Plain dataclasses and types that represent open-data concepts in a portal-agnostic way. These are the lingua franca of the library—adapters map portal-native responses into these models and consumers work with them directly.

2. Adapters (datasluice.adapters)

Each open-data portal has a dedicated adapter that implements the BaseAdapter protocol. Adapters are responsible for:

  • Translating portal-native API responses into domain models.
  • Handling portal-specific pagination strategies.
  • Raising normalized errors.

The registry keeps track of known adapters; the factory resolves which adapter to use for a given portal URL or type.

3. Transport (datasluice.transport)

A shared HTTP client layer with built-in retry, rate-limiting, pagination, and user-agent management. All adapters delegate network I/O to this layer.

4. Auth (datasluice.auth)

Pluggable authentication strategies (API key, bearer token, basic, custom headers). Auth strategies are injected into the transport layer.

5. Discovery (datasluice.discovery)

Auto-detection of portal types from URLs using fingerprints and heuristics, so users can pass a portal URL without knowing its software.

6. IO (datasluice.io)

File downloading, caching, checksum verification, and storage abstraction for materializing resources locally.

7. Formats (datasluice.formats)

Format-specific readers that normalize CSV, JSON, XLSX, Parquet, and GeoJSON into common in-memory representations.

8. Integrations (datasluice.integrations)

Optional integrations with the broader data ecosystem: pandas, Polars, dlt, Apache Airflow, and DuckDB.

Design Principles

  • Portal-agnostic domain models — consumers never touch portal-native JSON.
  • Adapter isolation — each portal's quirks live in its adapter subpackage.
  • Composable transport — retry, rate-limiting, and pagination are decorators, not baked into adapters.
  • Lazy imports — heavy optional dependencies (pandas, dlt, etc.) are imported on demand.