12 min read

Lightweight is a metric, not an adjective

AB
Ali Ben
Member of Technical Staff
Blog post illustration

To collect metrics and logs from our users' servers, we built simob, an open-source monitoring agent.

We like it to say it's lightweight. Everyone calls their software lightweight. But here's the thing: the more powerful our systems get, the more software finds ways to use those resources.

No one seems to care about resource usage anymore. We do. And we measured it.

What does “lightweight” even mean?

A program has two kinds of footprint. The static footprint: what the binary does to the system without running. Think disk size, bundled libraries, and runtime dependencies.

Dynamic footprint is what it consumes while running: CPU, memory, I/O, and so on.

simob static footprint is minimal by design. It ships as a single standalone binary (under 10 MB) with no external dependencies. There’s not much to analyze there.

So this post focuses on the dynamic footprint. We’ll look at three resources: CPU, memory and disk. We’ll ignore network usage. For a monitoring agent, most network traffic is just the data it’s sending upstream, and there’s little room to optimize that without changing what gets collected.

CPU

For CPU, the question is simple: how much time does the agent actually spend running?

CPU time is the total duration during which the processor is executing a program’s instructions. From that, you can derive an average CPU utilization over a given period. This includes both the time spent running the agent’s own code and the time spent executing kernel code on its behalf.

Because simob spends most of its life idle (sleeping while it waits for the next polling interval), spot measurements don’t make much sense. You can easily catch the process at 0% CPU and learn nothing useful.

Instead, we measure total CPU time over the entire lifetime of the process and relate it back to wall-clock time. That approach reflects what we actually care about: the real, amortized cost of running the agent on a server.

Memory

Measuring memory footprint is trickier. Unlike CPU, we don't have a way to express memory usage over the process lifetime with a single number. Instead, we look at metrics like the mean, max, or 95th percentile across the process lifetime.

A key challenge is attributing memory accurately. Some memory is shared across multiple processes, like libraries loaded in common. There are a few ways to account for this:

  • Resident Set Size (RSS) measures all the physical RAM a process touches, including shared pages.
  • Proportional Set Size (PSS) splits the cost of shared pages among all processes using them.
  • Unique Set Size (USS) counts only memory unique to the process (the exact amount that would be freed if the process exited).

Disk

simob doesn’t write much to disk. The main thing is a spool directory, where metrics and logs sit before being forwarded to the remote store. If the network is down, the directory can grow, but it generally clears quickly once connectivity is restored. Even if it lingers for hours, the footprint is small enough to not be a concern.

The agent also performs other I/O, mostly by tailing log files it’s monitoring. So we still need to measure it.

There are two ways to track disk usage: count bytes read/written or count I/O requests. Bytes are the more meaningful metric, especially on SSDs.

Like CPU, we measure the total values at the start and end of the process to calculate the lifetime footprint. This gives a realistic picture of the agent’s disk impact over time.

The truth is in /proc

CPU

On Linux, you can get CPU usage from /proc/[pid]/stat. Specifically, look at utime (time spent in userspace) and stime (time spent in the kernel on behalf of the process). These are monotonically increasing counters that track how long the process has actually run.

These values are stored in clock ticks (or jiffies). On most systems, one tick is 1/100 of a second, but it’s safer to query _SC_CLK_TCK at runtime.

def get_cpu_ticks(pid):
  with open(f"/proc/{pid}/stat", "r") as stat:
    data = stat.read().split(" ")
  utime = int(data[13])
  stime = int(data[14])
  return utime + stime

Fun Fact: /proc/[pid]/stat is a "ghost" file. It doesn't exist on your hard drive; instead, the kernel generates the data instantly the moment you try to read it. It pulls the timing stats directly from memory, formats them into text, and then discards them.

Memory

For memory, we look at , /proc/[pid]/smaps. specifically the fields Private_Clean and Private_Dirty. These represent the memory unique to the process. Adding them together gives the Unique Set Size (USS), the portion of RAM that would actually be freed if the process exited. This is the most meaningful number for understanding a process’s real memory footprint, since it ignores shared pages like common libraries.

def get_mem_uss(pid):
  with open(f"/proc/{pid}/smaps_rollup", "r") as smaps:
    matches = re.findall(
      r"^(?:Private_Clean|Private_Dirty):\s+(\d+)\s+kB",
      smaps.read(), re.MULTILINE
    )
  return sum(int(m) for m in matches)

Disk

For disk I/O, Linux exposes /proc/[pid]/io. The fields we care about are read_bytes and write_bytes, which track the total number of bytes the process has read from and written to disk.

def get_disk_bytes(pid):
  with open(f"/proc/{pid}/io", "r") as io:
    content = io.read()
  for line in content.split("\n"):
    if len(line) == 0: continue
    entry = line.split(": ")
    if entry[0] == "read_bytes":
      read_bytes = int(entry[1])
      continue
    if entry[0] == "write_bytes":
      write_bytes = int(entry[1])
      continue
  return read_bytes, write_bytes

Just like CPU, we can read these counters at the start and end of the process to calculate the total disk footprint over its lifetime. This gives a realistic picture of how much I/O the agent actually generates.

Two ways to measure

There are two ways to use these measurements.

First, since /proc/[pid] files exists for as long as the process runs, and our agent is long-lived, we can collect real historical usage data from production deployments. CPU and disk stats are counters, so we can compute averages over days or even weeks. Memory isn’t a counter, so we only get snapshots, but even those give a useful sense of how the agent behaves across our own fleet and different configurations.

The second approach is integrating these measurements into CI. Our dry run in CI runs the agent for about 20 seconds with a 3-second collection interval. It’s obviously not representative of real-world usage, but it works as a relative checkpoint. We can compare versions and flag any significant regressions before release.

Real numbers from production

We’ll look at lifetime stats from four different Linux servers running the agent.

Uptime
(hours)
CPU
util. (%)
Memory
(current, kB)
Disk read
(avg, B/s)
Disk write
(avg, B/s)
11660.3%15 8844157
29190.18%11 90413302
33180.01%9 5082068
43180.02%10 47674137

The results are pretty good. The CPU footprint is tiny: between 0.01% and 0.3% of a single core. Variance mostly comes from configuration differences. The highest usage appears on systems running busy web servers, where a flood of logs increases CPU time for processing.

Memory is just a snapshot, and unsurprisingly, it stays low. simob doesn’t hold much in memory.

Disk read and write rates are also extremely low when averaged per second. In short, the agent is light on resources, even under real-world conditions.

What about the CI ?

Live measurements in CI are a bit trickier because the program is running and /proc/[pid] disappears when it exits.

For CPU and disk, we take a first sample near program start and a second sample near the end, then compute the delta. That gives us the total usage over the short run.

Memory is sampled at a high rate, letting us report statistical value.

Of course this is not very precise. But the goal isn’t microbenchmarks and we’re not trying to break records. We just want a quantitative way to track how the agent behaves and ensure progress toward an even lighter footprint.

To validate the measurements, we built some simple stress tests:

  • A bash script to saturate one CPU core
  • A Python script that loads 100 MB into memory
  • A bash script that downloads a 100 MB file and writes it to disk

For all three tests, measured values were close to expectations, with minor differences due to overhead. The results give confidence our measurement approach works.

Adding these measurements to our existing GitHub Actions workflow is simple, as the script is pure Python and only use the standard library.

We run it three times to catch any variance between runs. Disk I/O is ignored in CI, since the short test inflates reads and writes at startup (config files, setup, etc.). CPU and memory metrics, however, remain reliable for spotting regressions.

RunCPU utilization (%)Memory (p95, kB)
10.25%12 000
20.25%13 336
30.24%11 212

Our three CI runs show very little variance, which is great. The measurements are repeatable.

CPU and memory numbers also line up closely with our long-term fleet stats. CPU appears slightly higher in CI because we use a 3-second collection interval instead of 60 seconds in production. Adjust for that, and it matches the fleet numbers.

Keeping ourselves honest

Is this methodology perfect? No. There's plenty of room to debate our approach how we sample memory, whether CI runs are representative, what we're not measuring. But it works. It's repeatable, it runs in CI, and it gives us real production data across our fleet.

More importantly, it keeps us honest. When we say simob is lightweight, we can point to actual numbers: 0.01–0.3% CPU, minimal memory, negligible disk I/O. And when we ship new features, we'll know immediately if they bloat the footprint.

The measurement script is open source, just like simob. If you're running your own monitoring agent and wondering what it actually costs, now you know how to find out.


Simple Observability is a platform that provides full visibility into your servers. It collects logs and metrics through a lightweight agent, supports job and cron monitoring, and exposes everything through a single web interface with centrally managed configuration.

To get started, visit simpleobservability.com.

The agent is open source and available on GitHub

Continue Reading