delta-doctor

Production-ready Delta table maintenance for Microsoft Fabric.

Fabric automates Delta maintenance for Warehouses. For Lakehouses, you're on your own. Delta Doctor fixes that.

v0.1 — ready to deploy

Silent cost inflation

Fabric capacity SKUs double in cost at every tier. Poor Delta table maintenance silently inflates that cost — not because you have more data or more users, but because the tables have never been properly maintained. This is not a data volume problem. It is a maintenance problem.

Small files accumulate

Every Dataflow, Copy activity, and Python kernel notebook (DuckDB, Polars) lands small files with no auto-compaction. Even Spark notebooks — which have auto-compaction available — still need explicit OPTIMIZE and VACUUM to stay healthy.

Deletion vectors build up

MERGE and DELETE operations write deletion vector files rather than rewriting data. They accumulate silently and slow every subsequent read.

Queries scan more than needed

Stale liquid clustering and oversized files mean the SQL Analytics Endpoint and Direct Lake scan far more data than necessary on every query.

Capacity SKUs double at every tier

An F64 costs twice an F32. A poorly maintained Lakehouse can force you to the next SKU when the real fix is a VACUUM and an OPTIMIZE.

Three pillars of maintenance

delta-doctor is organised around Diagnosis, Treatment, and Preventative Care — seven notebooks you import directly into your Fabric workspace. No package manager, no infrastructure, no configuration files.

01

Diagnosis

Understand what your tables look like before you change anything.

  • doctor_diagnosis_table_health Scans all tables in a Lakehouse and produces a health report — file counts, average file sizes, fragmentation status, deletion vector state, and clustering state. Classifies each table as Healthy, Review, Needs OPTIMIZE, Oversized, or a skip status.

02

Treatment

Fix what is broken and restore tables to a healthy baseline.

  • doctor_treatment_table_maintenance Runs OPTIMIZE and VACUUM on a single table. Called as the final step of each pipeline load.
  • doctor_treatment_maintenance_orchestrator Iterates all tables in a Lakehouse, running OPTIMIZE and VACUUM on each. Scheduled pipeline for Lakehouse-wide maintenance.
  • doctor_treatment_rebaseline_orchestrator Runs REORG TABLE APPLY (PURGE) followed by OPTIMIZE on every table, rewriting files to the layer target and purging deletion vectors. One-off rebaseline for a neglected Lakehouse.

03

Preventative Care

Configure tables and sessions correctly so problems do not reoccur.

  • doctor_prevention_session_config Sets up a Spark session with the correct baseline configurations for a given medallion layer. Called at the top of every pipeline notebook.
  • doctor_prevention_set_table_properties Sets Delta table properties by layer on a single table — deletion vectors, auto-compaction, optimise write, V-Order, and target file size. Optionally enables liquid clustering.
  • doctor_prevention_set_properties_orchestrator Iterates all tables in a Lakehouse and applies table properties for each. Run once per Lakehouse at onboarding time.

Up and running in three steps

No package manager. No configuration files. Import and run.

  1. Download or clone the repository

    Clone bradcoles-dev/delta-doctor from GitHub, or download the ZIP and extract it locally.

  2. Import the notebooks into your Fabric workspace

    Open the Data Engineering experience in Microsoft Fabric and use Import notebook to upload the notebooks. All notebooks must reside in the same workspace as the target Lakehouse.

  3. Run doctor_diagnosis_table_health first

    Pass lakehouse_guid as a parameter and run it to see the current state of your tables before changing anything. Your Lakehouse GUID is visible in the browser URL when the Lakehouse is open in Fabric.

Prerequisites: Microsoft Fabric workspace, Lakehouse, Spark Runtime 1.3 or later.