Datadog starts at $23 per host per month. New Relic’s full-stack observability runs $46 per host per month. If you are running three servers for an indie SaaS project generating $2,000 a month in revenue, you are looking at spending 3–7% of gross on monitoring before you have paid for a single line of application infrastructure. For teams operating at enterprise scale with dozens of engineers and hundreds of hosts, these prices are defensible. For everyone else — solo operators, small product teams, bootstrapped founders — they represent an architectural tax on projects that haven’t earned it yet. The open-source monitoring stack built around server monitoring Prometheus Grafana small teams produces observability that is genuinely comparable to those commercial offerings, at the cost of a few hours of setup and roughly 350MB of additional RAM per server.

What the Pricing Reality Actually Looks Like

The Datadog and New Relic pricing pages are written to make comparisons difficult. The headline numbers — $23 and $46 per host per month respectively — represent base tiers that exclude features you will actually need: custom metrics, extended retention, APM traces, log indexing. A realistic Datadog bill for a small team with three hosts, custom application metrics, log ingestion, and APM comes out closer to $180–250 per month. New Relic’s pricing model changed to consumption-based in 2022, which sounds more flexible but tends to be more expensive once real usage kicks in.

None of this means these tools are bad. Datadog in particular has genuinely excellent UX, strong alerting infrastructure, and integrations that would take weeks to replicate. The question is whether those advantages are worth $200+ per month to a project that might be generating modest revenue or still in pre-revenue. For most small teams, the honest answer is no — not because the tools aren’t good, but because the open-source alternatives are good enough that the gap doesn’t justify the cost.

The monitoring stack described here — Prometheus, Grafana, Alertmanager, and Uptime Kuma — costs nothing in software licensing and requires a server you almost certainly already own.

The Stack and What Each Component Does

These four tools have distinct, non-overlapping responsibilities. Understanding what each one actually does prevents the confusion that comes from treating the stack as a monolith.

Prometheus is a time-series database and metrics collection engine. It works by scraping HTTP endpoints that expose metrics in a text-based format, storing those metrics locally, and providing a query language (PromQL) for analysis. Prometheus is pull-based — it calls out to your services on a schedule, rather than services pushing data into it. This architecture means you always know exactly what Prometheus is collecting and how frequently. Resource footprint: approximately 200MB RAM for a typical small deployment, with disk usage proportional to retention period and metric cardinality.

Grafana is a visualization and dashboarding layer. It connects to Prometheus (and many other data sources) and renders the data as panels, graphs, tables, and stat cards. Grafana is where you spend most of your time — building dashboards, investigating anomalies, setting up alert rules. Resource footprint: approximately 150MB RAM. Grafana does not store metrics; it queries Prometheus and renders the results in real time.

Alertmanager handles the notification routing for alerts defined in Prometheus. When Prometheus detects a condition that crosses a threshold — high CPU sustained for five minutes, disk at 90%, HTTP error rate above 1% — it sends the alert to Alertmanager, which then routes it to the appropriate destination: Slack, PagerDuty, email, or a webhook. Alertmanager also handles deduplication, grouping, and silencing, which are the features that make the difference between a useful alerting system and a noise machine.

Uptime Kuma is a self-hosted status page and uptime monitor. It checks your endpoints on a schedule (every 60 seconds by default), tracks response times and status codes, and generates a public status page you can share with users. It handles the “is the site up from the outside?” question that internal Prometheus metrics don’t answer, and it provides the kind of incident status page that communicates reliably with your users during outages.

Node Exporter: The Foundation of Host Metrics

Before Prometheus can collect host-level metrics — CPU, memory, disk, network — those metrics need to be exposed in a format Prometheus understands. Node Exporter does this. It runs as a lightweight daemon on each host you want to monitor, exposing a /metrics endpoint on port 9100 that Prometheus scrapes on whatever interval you configure.

Installation is straightforward. Download the latest release from the Prometheus GitHub repository, extract it, and run it as a systemd service:

# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xvf node_exporter-1.8.0.linux-amd64.tar.gz
sudo cp node_exporter-1.8.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo useradd -rs /bin/false node_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

With Node Exporter running, configure Prometheus to scrape it by adding a job to your prometheus.yml:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100', '10.0.0.2:9100', '10.0.0.3:9100']
    scrape_interval: 30s

Node Exporter exposes over 800 metrics by default. Most of them you will never look at. The ones that matter for day-to-day operation are a much smaller set, which is why having opinionated dashboards matters more than having exhaustive metric collection.

The Metrics That Actually Matter

Operating a useful monitoring setup requires choosing what to care about. Monitoring everything creates the same problem as monitoring nothing: when everything is on a dashboard, nothing is important. The metrics worth tracking for a typical web application server fall into six categories.

CPU utilization matters less than most people think at low to moderate levels, and matters enormously as it approaches saturation. The metric to watch is not instantaneous CPU percentage but sustained utilization — specifically, CPU usage above 80% for more than five consecutive minutes is worth an alert. Instantaneous spikes are normal and expected; sustained high CPU indicates a structural problem.

Memory pressure is more nuanced than the percentage-used figure suggests. Available memory includes cached memory that the kernel will reclaim under pressure, so a server showing 85% memory used is often perfectly healthy. The metrics that indicate genuine memory pressure are available memory trending toward zero, active swap usage increasing, and OOM kills in the kernel log. Track node_memory_MemAvailable_bytes rather than used percentage for a more accurate picture.

Disk usage and I/O split into two distinct concerns. Disk space exhaustion is a hard outage — when the root partition fills, most services fail in interesting and hard-to-debug ways. Set an alert at 85% full with enough runway to act before reaching 100%. Disk I/O saturation shows up as elevated node_disk_io_time_seconds_total and increased read/write latencies; on database servers especially, this is often the first signal that something has changed in query patterns or data volume.

Network traffic baseline tracking is primarily useful for anomaly detection. Knowing your server normally transmits 500GB per month means a spike to 2TB in a week is immediately visible as unusual. Track both ingress and egress; unexpected egress is often the first observable sign of data exfiltration or a misconfigured service sending traffic somewhere it shouldn’t.

HTTP response times and error rates require application-level metrics beyond what Node Exporter provides. If you are running Nginx, the nginx-prometheus-exporter exposes request rates, error rates, and active connections. For application servers, the Prometheus client libraries for Go, Python, Node.js, and most other languages let you instrument your application directly and expose custom metrics on a /metrics endpoint. The two metrics that define service health are p95 and p99 response latency, and HTTP 5xx rate as a percentage of total requests.

Uptime and external reachability is where Uptime Kuma fills the gap that internal metrics leave. A process can be running, consuming CPU normally, and returning 200 status codes to internal health checks while being unreachable from the public internet due to a firewall rule change, DNS propagation issue, or upstream network problem. External monitoring from a different network location is not optional if your product relies on uptime.

Alert Fatigue: Setting Thresholds That Don’t Cry Wolf

The failure mode that destroys monitoring programs is not missed alerts — it is alert exhaustion from too many alerts that don’t require immediate action. When engineers start ignoring pages because most of them are non-critical noise, the real critical alerts get lost in the noise. This happens quickly and is harder to reverse than people expect.

The operating principle for small team alerting is that every alert that fires should require someone to look at it within a reasonable time window, and should have a known remediation path. If an alert fires regularly and the response is always “looks fine, nothing to do,” it is a bad alert that should be removed or have its threshold adjusted.

Metric Warning Threshold Critical Threshold Evaluation Window
CPU utilization > 80% for 5m > 95% for 5m 5 minutes
Memory available < 500MB < 200MB 2 minutes
Disk usage > 80% > 90% 10 minutes
HTTP 5xx rate > 1% of requests > 5% of requests 3 minutes
HTTP p99 latency > 2s > 5s 5 minutes
External uptime 2 consecutive failures 2 minutes

The evaluation window is as important as the threshold value. An alert that fires on a single data point will fire constantly on transient spikes that resolve themselves. Using for: 5m in Prometheus alerting rules means the condition must be consistently true for five minutes before the alert fires, which eliminates most false positives at the cost of slightly delayed notification on genuine sustained problems.

A minimal but effective Prometheus alerting rule for disk space:

groups:
  - name: host_alerts
    rules:
      - alert: DiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High 5xx error rate on {{ $labels.instance }}"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

Grafana Dashboards: Building Useful, Not Pretty

The temptation with Grafana is to build dashboards that look comprehensive. Seventeen panels covering every available metric, multiple rows, custom color schemes. These dashboards are exhausting to use during an incident, which is exactly when you need your dashboard to be fast and unambiguous.

The production dashboards that get used consistently are small and opinionated. A host overview dashboard should answer one question per panel: Is CPU okay? Is memory okay? Is disk okay? Are errors elevated? Is response time normal? Five panels covering those five questions is more useful than twenty panels covering every metric Node Exporter exposes.

The Grafana dashboard repository at grafana.com/grafana/dashboards has pre-built Node Exporter dashboards (ID 1860 is the most widely used) that are good starting points. The value of using a community dashboard is that the PromQL queries have been tested and refined by thousands of users. The cost is that they tend to be more comprehensive than necessary — plan to delete panels that don’t correspond to things you actually care about.

For application-specific dashboards, build around your SLOs (Service Level Objectives) if you have defined them, or around the user-facing behaviors you care about most. Request volume, error rate, and p95 response time — the “RED method” — provide a starting point for any service dashboard. Adding panels for your specific application concerns (queue depth, cache hit rate, active sessions) depends entirely on what your application does.

Grafana Cloud Free Tier: A Legitimate Middle Ground

Running Prometheus on the same server you are monitoring creates a monitoring availability problem: if the server goes down, your monitoring goes down with it. Grafana Cloud’s free tier offers a reasonable solution to this without requiring a dedicated monitoring server.

The free tier provides 10,000 active metric series, 50GB of log storage, and 14-day retention — enough for monitoring two or three servers with reasonable metric cardinality. The architecture uses a local Prometheus instance for scraping and rule evaluation, with remote write configured to ship metrics to Grafana Cloud for storage and visualization:

remote_write:
  - url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
    basic_auth:
      username: <your-instance-id>
      password: <your-api-key>

This hybrid approach keeps the local Prometheus instance handling scraping and alerting, while dashboards and long-term storage live in Grafana Cloud. When your server is down, you can still view recent metrics and investigate the timeline of events that led to the outage — which is precisely when that visibility matters most.

The 10,000 series limit sounds generous until you understand metric cardinality. Every unique combination of metric name and label values is a separate series. If you expose per-URL-path metrics on a service with hundreds of endpoints, you will exhaust 10,000 series quickly. Keep high-cardinality labels (user IDs, request IDs, specific URLs) out of your Prometheus metrics and use Loki for log-level data instead.

Loki for Logs: The Missing Piece

Prometheus handles numerical metrics. Logs are a different data type — unstructured text that needs to be searched and correlated with metric anomalies. Loki is Grafana’s log aggregation system, designed to complement Prometheus with a similar label-based data model.

The key architectural decision with Loki is that it does not index log content — only log labels (source, service name, host, level). This makes Loki much cheaper to run than Elasticsearch-based stacks, at the cost of full-text search performance on high-volume log streams. For small teams, the tradeoff is almost always worth it: Loki on a modest server handles the log volumes of a typical small SaaS without consuming significant resources.

Promtail is the agent that ships logs from your servers to Loki. It watches log files (or Docker container logs via the Docker socket) and forwards entries with labels attached. The typical configuration watches system journal and application log files:

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          host: my-server
          __path__: /var/log/*.log

  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: my-server
          __path__: /var/log/nginx/access.log

With both Prometheus and Loki feeding into Grafana, you can correlate a CPU spike visible in metrics with the specific requests visible in logs at the same timestamp — which is what transforms monitoring from a notification system into an actual debugging tool.

When to Upgrade to Paid Monitoring

The open-source stack described here has real limitations, and knowing where those limits are prevents you from spending time fighting your tools when the correct decision is to pay for a better one.

The signals that indicate you should seriously evaluate paid monitoring are specific and observable:

  • Your team has grown to the point where multiple people need reliable alerting, and your self-hosted Alertmanager is causing on-call coordination problems rather than solving them.
  • You are running more than 10–15 hosts with complex service interdependencies, and the operational overhead of maintaining the monitoring stack has become a measurable time cost.
  • You need distributed tracing across microservices. Prometheus and Loki do not solve the distributed tracing problem well; Jaeger or commercial APM tools do.
  • Compliance requirements mandate log retention policies, audit trails, or monitoring infrastructure that can demonstrate high availability to auditors.
  • Your monitoring infrastructure has gone down at the same time as your application infrastructure, and the lack of visibility made incident resolution meaningfully harder. This is the most compelling operational signal.

At the point where monitoring tool maintenance is consuming more than two to three hours per month of engineering time — upgrades, storage management, configuration drift — the economics of self-hosted monitoring start shifting. Two hours per month at a senior engineer’s time cost approaches the price of a basic Datadog subscription. That calculation changes the answer.

The Resource Overhead Reality Check

Running the full stack — Prometheus, Grafana, Alertmanager, Loki, Promtail, and Uptime Kuma — on a dedicated monitoring server or a lightly used host requires approximately 600–800MB of RAM in typical operation. On a $6/month VPS with 1GB RAM, this is the entire machine. On a $12/month server with 2GB RAM, it leaves adequate headroom.

The more common pattern for small teams is running the monitoring stack on a server that is doing other things. A 4GB RAM application server can comfortably host the monitoring stack alongside its primary workload, provided the application services have memory limits configured and the monitoring components are treated as first-class processes rather than afterthoughts. On servers where RAM is the constraint, Prometheus’s storage configuration should explicitly set a retention period — --storage.tsdb.retention.time=15d — to prevent unbounded disk growth.

Prometheus’s CPU footprint is low during normal operation (scraping and storing) and higher during complex PromQL queries. The practical limit for this stack on a shared server is monitoring about 20–30 hosts per Prometheus instance, at which point dedicated monitoring infrastructure becomes the better operational choice.

The Practical Path Forward

The monitoring stack described here — Prometheus for metrics, Grafana for dashboards, Alertmanager for notifications, Loki for logs, Uptime Kuma for external checks — represents a complete observability setup that would cost $500+ per month on commercial platforms. The open-source version costs your time to set up, a small amount of server RAM, and ongoing maintenance that amounts to a few hours per quarter for updates and configuration changes.

Start with Node Exporter on every host and a single Prometheus instance. Add Grafana and connect it to your Prometheus instance. Build a simple host overview dashboard covering CPU, memory, disk, and network. Configure two or three alerts that are actually actionable. Add Uptime Kuma and point it at your public endpoints. That core setup is functional in a few hours and covers 80% of the monitoring value. Layer in Loki, Alertmanager configuration, and application-specific metrics as your needs become clearer.

The teams that abandon self-hosted monitoring usually do so because they built too much at once, couldn’t maintain the complexity, and concluded that the tools were the problem. The teams that run it successfully start small, add only what they need, and treat the monitoring stack as production infrastructure deserving the same discipline as the applications it monitors — pinned versions, restart policies, and actual backup strategies for the Prometheus data directory.

Monitoring is not a category where spending more money automatically produces better outcomes. The outcome is visibility, and visibility depends on whether the alerts you configured are actually meaningful and whether the dashboards you built reflect the questions you need answered. Those are engineering decisions, not licensing decisions.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *