The Zen of Declarative Data: Moving from “Steps” to “Specs”
If you look at a typical Spark notebook, you usually see a story of verbs: Read this, filter that, join these, write there. In academic terms, this is Imperative Programming—a manual recipe that focuses on the control flow of the machine.
But there is a more powerful paradigm: Data as Code. Borrowing from the “Code is Data” philosophy, we can move toward a Declarative Programming1 approach. Instead of writing a sequence of commands, we define our data structures in a way that the system can “read” and execute for us. We stop writing recipes and start writing specifications.
The Problem: The “Opaque” Pipeline
In an imperative pipeline, logic is “opaque.” If you want to know what the data quality rules are, you have to execute the code or parse complex strings. In Software Engineering theory, this is a violation of Introspection2, the ability of a program to examine its own structure at runtime.
By treating our pipeline as data, we make the logic transparent, reusable, and structured.
Exploring the Raw Reality
Before we can define what the data should be, we have to see what it is. Here is our raw GTFS transit data for NYC subway stops—unfiltered and messy:
The Principle: Homoiconicity in Data
In a declarative system, we treat our transformations as metadata. In Python, we can achieve this through an Internal DSL3 using decorators. We wrap our logic in a layer of “Data as Code.”
@temporary_view(name="silver_table")
@expect_or_drop(lambda df: df["parent_station"].notna())
def silver_table():
df = spark.read.format("delta").load(BRONZE_PATH)
return df.withColumn("silver_ingest_ts", F.current_timestamp())
The Lisp community introduced Homoiconicity4: the radical idea that a program’s internal representation should be identical to its data structure. In Lisp, code is just a nested list, and lists can be manipulated by other code.
While Python is not a homoiconic language by nature, our declarative approach mimics this principle to solve the “Opaque Pipeline” problem. We stop treating our Python functions as black boxes of execution and start treating them as Intermediate Representations (IR) of our data platform’s state.
- The Function as an Object (IR): The
@tabledecorator transforms a standard Python function into a rich data object. Before the Spark engine even initializes a task, the decorator has already “read” the function to extract destination paths, partitioning keys, and schema metadata. The function has effectively become a manifest of the table it represents. - Metadata as Executable Documentation: Traditional documentation rots because it lives in a Wiki or a comment. In our model, the
@expect_or_dropdecorator acts as a Semantic Constraint. By turning a business rule (like “parent_station cannot be null”) into an attribute of the function object, the rule is no longer just described—it is encoded into the pipeline’s DNA. The code doesn’t just perform the check; it is the check.
This creates a “System of Record” for your logic. The result is a refined dataset where the rules are enforced by a runtime that understands the “Pipeline Data” we encoded in our decorators:
Deep Thought: The Power of Introspection
In a standard imperative pipeline, the code is a “black box” to the orchestration layer. The system knows when to run a script, but it has no idea what the script intends to do until the CPU starts executing lines.
By treating the pipeline as a structured registry of metadata, we enable Reflection. Because our decorators have captured the “intent” of the code (the destination table, the quality constraints, the source dependencies), we can query the codebase as if it were a database itself.
This shift transforms a repository from a collection of opaque scripts into a Knowledge Graph of Data Logic. We can programmatically enforce global policies—such as “No Silver-tier table may be written without a non-null constraint”—without ever running the Spark job. The system finally “knows” itself, allowing for automated documentation and lineage that is guaranteed to be accurate because it is derived directly from the code’s own structure.
The Open-Closed Pipeline: Scaling Without Friction
Traditional pipelines often suffer from “logic bloat.” As business requirements grow, a simple 10-line transformation evolves into a 200-line monolith of if-else blocks and validation checks. This violates the Open-Closed Principle: the idea that a system should be open for extension but closed for modification.
In a declarative, homoiconic-inspired setup, we achieve this by separating the Core Transformation from the Behavioral Traits.
- The Core is Closed: The actual Spark logic remains focused solely on the business transformation.
- The Traits are Open: New requirements—like PII masking, data quality checks, or specialized partitioning—are added as new decorators.
You are extending the pipeline’s capabilities by adding more “data” (metadata) to the function’s definition, rather than hacking away at the “guts” of the execution logic. This modularity is what prevents the “spaghetti” effect as data platforms scale.
Conclusion: From Coder to Architect
When we embrace Data as Code, the role of the Data Engineer undergoes a fundamental shift. We move beyond the “Data Janitor” phase—spending our days manually cleaning and moving bytes from point A to point B.
Instead, we become Architects of Intent. We build systems that don’t just execute chores, but understand the contracts and constraints of the data they process. By utilizing Delta Tables and declarative decorators, we create pipelines that are robust, searchable, and self-aware. We aren’t just writing scripts anymore; we are encoding the very DNA of our data’s journey.
Footnotes
-
Declarative Programming; A paradigm that expresses the logic of a computation without describing its control flow. It focuses on what the program should accomplish rather than how to achieve it. ↩
-
Introspection / Reflection; The ability of a system to observe and modify its own structure and behavior. In this context, using decorators to inspect function metadata at runtime. ↩
-
Internal DSL; A Domain-Specific Language implemented within a general-purpose host language (like Python), utilizing the host’s syntax to create a specialized toolset for a specific field. ↩
-
Homoiconicity; From the Greek homo (same) and icon (representation). A property of some programming languages in which the primary representation of programs is also a data structure in a primitive type of the language. ↩