Beyond the Cron Job: Building Resilient Automated Reporting Pipelines in 2026
Stop manually exporting CSVs. Learn how to build a declarative, version-controlled reporting pipeline using Dagster and DuckDB that stakeholders can actually trust.

The Sunday Morning CSV Nightmare
I once spent my Sunday mornings manually exporting CSVs from a production Postgres instance because our "automated" reporting script died on a memory leak every time the dataset hit 1GB. If your reporting strategy relies on a fragile cron job and a prayer, you're not building a feature; you're building a PagerDuty alert that only fires when your CEO is looking at a blank dashboard. In 2026, there is no excuse for 'silent failures' or manual data munging.
Why Declarative Pipelines Matter Now
In the past, we treated reporting as a side effect of our applications. We'd write a script, schedule it with crontab, and hope the database was up. Today, the landscape has shifted. We are no longer just sending static PDFs; we are feeding downstream LLMs for automated executive summaries, updating live data-driven documents like Evidence.dev, and maintaining data contracts across microservices. The complexity has reached a point where imperative scripts (do this, then that) fail because they lack state awareness. We need declarative pipelines where we define the result we want, and the orchestrator handles the how.
Section 1: The 2026 Tech Stack for Reporting
Forget the heavy JVM-based monsters of the 2010s. For modern reporting, I've standardized on a three-tier stack that scales from a startup to a mid-sized enterprise without breaking the bank:
- Dagster (Orchestration): Unlike Airflow, Dagster treats data as a first-class citizen (Assets). It knows if a report is missing because the underlying data wasn't updated.
- DuckDB (Compute): The 'SQLite for Analytics.' It allows us to perform massive aggregations on Parquet files directly in the pipeline memory without needing a $5,000/month Snowflake cluster for simple reports.
- Evidence.dev (Delivery): It treats reports like code. You write Markdown and SQL, and it builds a high-performance web app. No more dragging boxes in Tableau.
Section 2: Building the Declarative Pipeline
Here is a concrete example of a Dagster asset that generates a daily sales report. This isn't pseudocode; this is how I structure production pipelines today. We use Python 3.12 and the latest Dagster API.
import dagster as dg
import duckdb
import pandas as pd
from datetime import datetime
@dg.asset(
group_name="finance_reports",
compute_kind="duckdb",
metadata={"owner": "ugur-kaval", "priority": "high"}
)
def monthly_revenue_summary(context: dg.AssetExecutionContext):
"""
Aggregates raw transaction data into a monthly summary for the CFO.
Uses DuckDB for high-speed local processing of S3 Parquet files.
"""
# In a real scenario, these paths would come from environment variables
raw_data_path = "s3://company-datalake/raw/transactions/*.parquet"
output_path = "s3://company-datalake/gold/monthly_revenue.parquet"
conn = duckdb.connect(database=':memory:')
# We use DuckDB to read directly from S3 using the httpfs extension
query = f"""
SELECT
date_trunc('month', transaction_date) as report_month,
category,
sum(amount_usd) as total_revenue,
count(distinct customer_id) as customer_count
FROM read_parquet('{raw_data_path}')
WHERE status = 'completed'
GROUP BY 1, 2
ORDER BY 1 DESC
"""
df = conn.execute(query).df()
# Data Quality Check: If revenue is negative, something is wrong with the source
if (df['total_revenue'] < 0).any():
raise dg.AssetValidationError("Negative revenue detected in summary!")
# Save to gold layer for Evidence.dev to pick up
df.to_parquet(output_path)
context.add_output_metadata({
"row_count": len(df),
"total_monthly_revenue": float(df['total_revenue'].sum()),
"last_updated": datetime.now().isoformat()
})
return df
The Schedule: Run at 8 AM UTC every day
sales_report_job = dg.DefineAssetJob(name="sales_report_job", selection="monthly_revenue_summary") sales_report_schedule = dg.ScheduleDefinition( job=sales_report_job, cron_schedule="0 8 * * *", execution_timezone="UTC" )
Section 3: The Delivery Layer
Once the data is processed, we don't just dump it. We use Evidence.dev to create a version-controlled report. The beauty here is that your report lives in the same Git repo as your pipeline. When the Parquet file updates, the report re-renders.
markdown
Finance Overview: {new Date().toLocaleDateString()}
<DataTable data={monthly_revenue_summary} /><LineChart data={monthly_revenue_summary} x=report_month y=total_revenue series=category title="Revenue Growth by Category" />
Senior Engineer Note: This report is generated automatically by the Dagster pipeline. If data looks stale, check the Dagster Cloud Logs.
What Went Wrong: Lessons from the Trenches
Early in my career, I ignored Backfills. I built a pipeline that worked perfectly for 'today,' but when a stakeholder asked for the last 3 years of data, the script timed out because it tried to load everything into memory.
The Lesson: Always partition your data. In Dagster, use DailyPartitionsDefinition. It allows you to run your pipeline for a specific day without re-processing the entire history. This saved us when a bug in our payment gateway corrupted data from two weeks ago; we simply 'replayed' those two days in the pipeline.
Another failure: The Empty Report Trap. One morning, the pipeline finished 'successfully' because the SQL query returned zero rows. The report was sent out blank. To fix this, I now always include AssetValidationError checks (as shown in the code above) to ensure that if the row count is zero or the data looks suspicious, the pipeline fails and alerts me before the stakeholder sees it.
Gotchas and 2026 Realities
- Timezone Hell: Never use local time in your database or your scheduler. Always store in UTC and convert only at the final visualization layer. I've seen 'duplicate' data in reports because of Daylight Savings transitions more times than I care to admit.
- Schema Drift: Your upstream database team will change a column name without telling you. Use a tool like Pydantic or Dagster's built-in Type checks to validate the dataframe schema before it hits your report.
- Compute Locality: Don't pull 50GB of data to your local machine to process it. Use DuckDB's ability to push filters down to S3 (predicate pushdown) so you only download the bytes you actually need.
Takeaway
Stop writing standalone Python scripts for reports. Today, pick one manual report you currently generate, define it as a Dagster Asset, and move the compute to DuckDB. By treating your report as a versioned data asset rather than a task, you gain observability, reusability, and most importantly, your Sunday mornings back.", "tags": ["Automation", "Data Engineering", "Python", "Dagster", "DuckDB"], "seoTitle": "Automated Report Generation with Data Pipelines and Scheduling (2026)", "seoDescription": "Senior Software Engineer Ugur Kaval shares a production-grade guide to building resilient automated reporting pipelines using Dagster, DuckDB, and Evidence.dev."}