Sunday at 3 AM Is When the Pipeline Decides to Teach You a Lesson
I have been paged on more Sundays than I care to count, and the root cause is almost never a hardware failure or a code bug. It is the data pipeline. Specifically, it is an ETL job that worked perfectly Monday through Friday and then quietly broke because the data it expected on Sunday looks nothing like the data it gets the rest of the week.
Production ETL pipelines fail on Sundays (and holidays, and month-end boundaries, and daylight saving transitions) because they are built on implicit assumptions about data that are only true most of the time. The most dangerous assumption is that the future will look like the recent past. Data pipelines are where that assumption goes to die.
The Seven Ways Pipelines Break on Weekends
1. Volume Drops Trigger False Alarms (or Mask Real Ones)
Most monitoring for data pipelines includes row count checks: if today’s batch has significantly fewer rows than yesterday’s, fire an alert. On Sundays, transaction volumes legitimately drop 40-70% for many B2B applications. The alert fires every Sunday. After a few weeks, the on-call engineer silences it. Then one Sunday, the pipeline actually breaks and nobody notices because the alert was already muted.
# The naive approach: static threshold
def check_row_count(table, expected_min=10000):
count = db.query(f"SELECT COUNT(*) FROM {table} WHERE date = CURRENT_DATE")
if count < expected_min:
alert(f"Low row count in {table}: {count} < {expected_min}")
# The better approach: day-of-week aware thresholds
def check_row_count_smart(table):
current_day = datetime.now().strftime("%A")
# Calculate expected range from same day-of-week in past 4 weeks
stats = db.query("""
SELECT
AVG(row_count) as avg_count,
STDDEV(row_count) as std_count
FROM pipeline_metrics
WHERE table_name = %s
AND day_of_week = %s
AND run_date > CURRENT_DATE - INTERVAL '28 days'
""", [table, current_day])
lower_bound = max(0, stats.avg_count - 2 * stats.std_count)
upper_bound = stats.avg_count + 2 * stats.std_count
count = get_today_count(table)
if count < lower_bound or count > upper_bound:
alert(
f"Anomalous row count in {table}: {count}. "
f"Expected {lower_bound:.0f}-{upper_bound:.0f} for {current_day}."
)
2. Source System Maintenance Windows
Many upstream data sources schedule maintenance on weekends. Your pipeline tries to connect at its usual time, gets a connection refused or a timeout, and either fails loudly or — worse — succeeds with an empty or partial dataset.
# A robust extraction step should handle maintenance windows
import tenacity
import httpx
@tenacity.retry(
stop=tenacity.stop_after_attempt(5),
wait=tenacity.wait_exponential(multiplier=1, min=60, max=3600),
retry=tenacity.retry_if_exception_type((httpx.ConnectError, httpx.ReadTimeout)),
before_sleep=lambda retry_state: logger.warning(
"source_unavailable",
attempt=retry_state.attempt_number,
wait=retry_state.next_action.sleep,
source="billing-api",
),
)
async def extract_billing_data(date: str):
response = await httpx.AsyncClient().get(
f"https://billing-api.internal/export?date={date}",
timeout=300,
)
if response.status_code == 503:
raise httpx.ConnectError("Source in maintenance mode")
data = response.json()
# Critical: validate that we got a complete dataset
if data.get("is_partial"):
raise PartialDataError(
f"Source returned partial data for {date}. "
f"Expected ~{data.get('expected_count', 'unknown')} records, "
f"got {len(data['records'])}. Likely maintenance window."
)
return data
3. Timezone and Daylight Saving Transitions
Twice a year, the clocks change. And twice a year, ETL pipelines that assume 24 hours in a day break. A pipeline that processes data in daily batches from midnight to midnight will either process 23 or 25 hours of data during a DST transition, leading to undercounting or duplicate processing.
# Bad: naive date arithmetic
from datetime import datetime, timedelta
def get_date_range(date):
start = datetime(date.year, date.month, date.day)
end = start + timedelta(days=1) # NOT always 24 hours!
return start, end
# Good: timezone-aware date boundaries
from zoneinfo import ZoneInfo
def get_date_range_safe(date, tz_name="America/New_York"):
tz = ZoneInfo(tz_name)
start = datetime(date.year, date.month, date.day, tzinfo=tz)
end = datetime(date.year, date.month, date.day, tzinfo=tz) + timedelta(days=1)
# During DST transition:
# Spring forward: end - start = 23 hours (this is correct!)
# Fall back: end - start = 25 hours (this is also correct!)
# Convert to UTC for database queries
return start.astimezone(ZoneInfo("UTC")), end.astimezone(ZoneInfo("UTC"))
# Even better: store and process everything in UTC
# and only convert to local time at the presentation layer
4. Late-Arriving Data
Weekend data often arrives late. Batch systems that process “yesterday’s data” at 6 AM assume the data is complete by then. On weekdays, this is usually true. On Sundays, data from Saturday night might not land until noon because the source system’s batch export runs on a weekend schedule.
# Watermark-based completeness check
async def wait_for_data_completeness(
source: str,
date: str,
expected_watermark: str = "23:59",
timeout_hours: int = 6,
check_interval_minutes: int = 15,
):
"""Wait until the source data includes events up to the expected watermark."""
start_time = datetime.now()
timeout = timedelta(hours=timeout_hours)
while datetime.now() - start_time < timeout:
latest_event = await db.fetchval(
"SELECT MAX(event_timestamp) FROM raw_events "
"WHERE source = $1 AND event_date = $2",
source, date,
)
if latest_event and latest_event.strftime("%H:%M") >= expected_watermark:
logger.info("data_complete", source=source, date=date,
latest_event=str(latest_event))
return True
logger.info("waiting_for_data", source=source, date=date,
latest_event=str(latest_event), expected=expected_watermark)
await asyncio.sleep(check_interval_minutes * 60)
raise DataIncompleteError(
f"Data for {source}/{date} not complete after {timeout_hours} hours. "
f"Latest event: {latest_event}. Expected watermark: {expected_watermark}."
)
5. Schema Changes That Land on Fridays
Someone deploys a schema change on Friday afternoon (against all common sense). The application is fine because it handles the new schema. But the ETL pipeline, which reads from a replica or an event stream, does not know about the new column or the renamed field. It runs fine Friday night and Saturday because the new field is nullable and the pipeline ignores extra fields. On Sunday, the first record with a required new field arrives, and the pipeline chokes.
# Defensive schema handling in transformation step
import pandas as pd
def transform_orders(raw_df: pd.DataFrame) -> pd.DataFrame:
# Define expected schema with defaults
expected_columns = {
"order_id": {"type": "str", "required": True},
"user_id": {"type": "str", "required": True},
"amount": {"type": "float", "required": True},
"currency": {"type": "str", "default": "USD"},
"discount_code": {"type": "str", "default": None},
"fulfillment_type": {"type": "str", "default": "standard"},
}
# Check for missing required columns
missing_required = [
col for col, spec in expected_columns.items()
if spec.get("required") and col not in raw_df.columns
]
if missing_required:
raise SchemaError(
f"Missing required columns: {missing_required}. "
f"Available columns: {list(raw_df.columns)}. "
f"This may indicate an upstream schema change."
)
# Add missing optional columns with defaults
for col, spec in expected_columns.items():
if col not in raw_df.columns and "default" in spec:
raw_df[col] = spec["default"]
logger.warning("schema_drift", column=col, action="added_default",
default=spec["default"])
# Warn about unexpected new columns
known_columns = set(expected_columns.keys())
unknown_columns = set(raw_df.columns) - known_columns
if unknown_columns:
logger.warning("schema_drift", new_columns=list(unknown_columns),
action="ignored")
return raw_df[list(expected_columns.keys())]
6. Resource Contention from Backup Jobs
Many organizations schedule database backups, index rebuilds, and VACUUM operations on weekends. These maintenance tasks compete for I/O and CPU with ETL pipelines that run on the same schedule. The pipeline that takes 20 minutes on a Tuesday takes 3 hours on a Sunday because the database is also running a full backup.
# Check for competing operations before running heavy queries
async def check_database_load(max_active_queries: int = 20):
active = await db.fetchval("""
SELECT COUNT(*) FROM pg_stat_activity
WHERE state = 'active'
AND query NOT LIKE '%pg_stat_activity%'
AND backend_type = 'client backend'
""")
if active > max_active_queries:
logger.warning("high_database_load", active_queries=active,
threshold=max_active_queries)
return False
return True
# Also check for long-running maintenance operations
async def check_for_maintenance():
maintenance = await db.fetch("""
SELECT pid, query, state, now() - query_start as duration
FROM pg_stat_activity
WHERE query ILIKE '%vacuum%'
OR query ILIKE '%reindex%'
OR query ILIKE '%pg_dump%'
ORDER BY duration DESC
""")
if maintenance:
logger.info("maintenance_detected",
operations=[{
"pid": m["pid"],
"type": m["query"][:50],
"duration": str(m["duration"])
} for m in maintenance])
return maintenance
7. The “No Data Is Valid Data” Problem
On some Sundays, there legitimately is no data. A B2B SaaS product might have zero transactions on a Sunday. The pipeline needs to distinguish between “no data because it is Sunday” and “no data because the extraction broke.” This is harder than it sounds.
# Track extraction metadata to distinguish empty from broken
@dataclass
class ExtractionResult:
source: str
date: str
record_count: int
extraction_started: datetime
extraction_completed: datetime
source_responded: bool
source_reported_complete: bool
@property
def is_empty_but_valid(self) -> bool:
return (
self.record_count == 0
and self.source_responded
and self.source_reported_complete
)
@property
def is_suspicious_empty(self) -> bool:
return (
self.record_count == 0
and (not self.source_responded or not self.source_reported_complete)
)
async def extract_and_validate(source: str, date: str):
result = await extract_data(source, date)
if result.is_suspicious_empty:
alert(
f"Extraction from {source} for {date} returned 0 records "
f"but source did not confirm completeness. "
f"Source responded: {result.source_responded}. "
f"Investigate before marking this run as successful."
)
elif result.is_empty_but_valid:
logger.info("empty_extraction_valid", source=source, date=date,
note="Source confirmed no data for this period")
Building Resilient Pipelines: The Checklist
- Use day-of-week-aware thresholds for volume checks and anomaly detection.
- Handle maintenance windows with retries and exponential backoff.
- Store and process timestamps in UTC. Convert to local time only at the presentation layer.
- Implement watermark-based completeness checks instead of assuming data is ready at a fixed time.
- Validate schemas defensively with known-column checking and default handling.
- Monitor resource contention and avoid running heavy ETL during backup windows.
- Distinguish between valid empty data and extraction failures using source metadata.
- Make pipelines idempotent. You should be able to rerun any pipeline for any date without duplicating data.
- Test with weekend data. Your CI pipeline should include test fixtures that simulate Sunday volumes, DST transitions, and late-arriving data.
- Keep a pipeline calendar that maps known irregular events (holidays, DST, month-end, quarter-end) to expected pipeline behavior changes.
The Operational Maturity Model
| Level | Characteristics | Sunday Behavior |
|---|---|---|
| 1: Ad hoc | No monitoring, manual fixes | Nobody knows it broke until Monday |
| 2: Reactive | Basic alerts, manual remediation | On-call gets paged, reruns manually |
| 3: Defensive | Retries, validation, day-aware thresholds | Most issues self-heal, real failures page |
| 4: Proactive | Anomaly detection, completeness checks, auto-remediation | Pipeline adjusts to weekend patterns automatically |
Most teams are at level 2. Getting to level 3 requires a week of focused engineering work. Getting to level 4 requires ongoing investment but pays for itself in reduced on-call burden and data quality.
Data pipelines break on Sundays because engineers build them on Tuesdays. The data on Tuesday is clean, complete, and arrives on time. Sunday data is messy, partial, late, and sometimes entirely absent. If your pipeline cannot handle Sunday gracefully, it is not a production pipeline — it is a Tuesday pipeline that happens to run every day. Build for Sunday, and every other day takes care of itself.
