How we built speed tests that persist through outages
Offline jobs, firmware invariants, and the engineering behind fleet-wide ISP monitoring.
Most Meter customers have centralized IT teams managing networks across dozens of remote sites. When a WAN link degrades at a branch office, the first diagnostic an IT team runs is a speed test in order to measure ISP throughput, latency, and jitter from the device at the edge. It’s the most basic diagnostic in the toolkit, but existing options like fast.com are designed for one person testing one connection. In an enterprise environment, they break down in three ways.
- They’re fragile. If the browser disconnects mid-test because the WAN link under test is often the same one carrying the session, the measurement is lost. There’s no durable record of a partially completed test and no way to resume that same test.
- They’re uncoordinated. When a link is slow, multiple people run tests simultaneously, which has the effect of DDoS-ing the network they’re trying to evaluate.
- They’re manual. There’s no way to run them on a schedule, from a central location, across a fleet of devices. It’s impossible to measure ISP performance that degrades gradually without continuous, automated measurement.
ISP performance is one variable in the network stack that Meter doesn't directly control, but it's one of the first things customers blame when service slows or stops. In order to deliver on Meter’s promise of managed visibility, we built a speed test in Dashboard that reliably supports on-demand and scheduled tests across any device in the network, even if internet service fails. To build this, we focused on three technical goals: triggering tests remotely and reliably, ensuring only one active speed test per device, and configuring schedules for recurring tests. Below, I’ll explain the architecture and the decision-making process behind each.
Triggering tests remotely and reliably
Speed tests need to be simple to trigger and guaranteed to complete. The standard implementation, a synchronous request that holds a connection open for the duration of the test, satisfies the first requirement but not the second. Any interruption to the admin's Dashboard session kills the test and loses the results. For remote IT teams, this is the norm, not the edge case.
Our solution was to treat the speed test as an offline job. When a customer starts the test from Dashboard, the API creates a durable job on the backend. From that point, the test’s lifecycle is independent of the client session. It runs to completion, stores its results, and can be viewed later, whether the browser stays connected or not.
The architecture that supports this system has six key components.
- Dashboard: the user interface.
- API: the backend that orchestrates jobs, polls for new data, and encapsulates the database backing it.
- Device: the Meter F-series hardware, which runs firmware logic to execute speed tests against Cloudflare.
- Job system: River Queue, which schedules and manages job execution both on-demand and on a schedule.
- ClickHouse: the system of record for completed test metrics. ClickHouse handles the large volume of time-series data sent by devices in the field, which require analytical and time-series–based queries.
- Speed test job: the driver that coordinates communication between the device and storage.
Here’s the flow:
- An admin starts a test in Dashboard. The request goes to the API (1).
- That API queries the device to check whether a test is already running (2).
- If the device reports no active test, the API uses InsertTx to create a new speed test job in River Queue (3). This is the durability boundary: once the job is committed to the queue, the test will run to completion independent of the client. If the device reports an active test, the API redirects to the existing job instead.
- Either way, Dashboard is pointed to a live view (4).
- To synchronize between the device and the backend, the speed test job first instructs the device to begin (5).
- It then enters a polling loop (6).
- There, it repeatedly queries the device for the intermediate statistics it has received from Cloudflare (7) and writes each snapshot—including current download speed, upload speed, and jitter—to the API database. The API database serves as the intermediate state store; it holds the latest-known statistics for in-progress tests.
Completion is determined by a different path.
- The speed test job polls ClickHouse (8) to detect when this write has landed. This two-database design is deliberate: the API database provides fast reads for live-updating clients during the test, while ClickHouse serves as the durable system of record optimized for the analytical queries that historical speed test data requires.
- The device then writes its final, complete set of statistics to ClickHouse (9).
- Finally, Dashboard polls the API (10) for the current state of the job and renders a live speed chart from the intermediate snapshots. Because every client reads from the same API endpoint, any number of admins can watch the same test without adding load to the device or the WAN link.
Note: Devices don’t write directly to ClickHouse. They publish messages to our Kafka instance, and stream processors handle writing the statistics to ClickHouse. This has been simplified in the diagram since it’s out of scope for this post. We’ll cover the Kafka pipeline in a future post.
Ensuring only one active speed test per device
Speed tests work by saturating the link with payloads from 100 KB to 25 MB. Running a second concurrent test on the same device produces two results that reflect the contention between tests, not actual ISP performance. During an outage, when multiple admins independently hit run on the same device within seconds, this is the default outcome unless it's prevented at the system level.
To solve this, we redirect subsequent requests to the in-progress test, rather than starting new ones. To stream live results to these concurrent viewers, we opted to have customers poll our API instead of using WebSockets, chiefly because it’s a simpler way for them to get the same answer. The speed test job already polls the device on a regular cadence to collect intermediate results, which is how data enters the system in the first place. WebSockets would have introduced a second real-time channel, with its own connection management and failure modes, for a marginal improvement in update latency, on a test that only runs for up to five mintues.
Configuring schedules for recurring tests
On-demand tests solve active incidents, but ISP performance also degrades gradually and requires continuous measurement. Customers need scheduled speed tests that run unattended, build a historical baseline, and surface degradation before it becomes an outage to monitor this.
Addressing this seems straightforward until fleet scale introduces a coordination constraint. Many customers schedule tests at midnight to minimize impact on daytime traffic. At hundreds of devices, this produces a burst of simultaneous large-payload requests to Cloudflare from a distributed set of IPs, which makes it difficult for Cloudflare to distinguish these tests from a distributed attack. If devices get flagged by bot detection, scheduled tests fail silently and customers lose the historical data they depend on.
To solve this, we gave customers the ability to configure automation rules in Dashboard: what to test (a network or specific port), a time window, and a frequency (daily or weekly). Once the customer defines a set of rules in Dashboard, a new offline cron job then spawns all the jobs in the scheduled range.
The scheduling architecture extends the on-demand flow with two new components: a cron-triggered scheduler and a Redis-backed rate limiter.
The flow looks like this:
- A cron job fires every 15 minutes. (1)
- It then spawns a recurring speed test job in River Queue (2). This job—the scheduler—queries the automation rules from the API database and determines which tests are scheduled to run in the current hour, alongside internal validations to ensure each target device is in a healthy state.
- Before enqueuing each test, the scheduler checks a token bucket rate limiter (3) to determine whether there is capacity to start a new speed test.
- We implemented the rate limiter in Redis (4), leveraging its fast, atomic operations and using key TTLs to efficiently clean up expired tokens.
- If capacity is available, the scheduler enqueues a speed test job (5) (6).
- The execution path from there is shared with the on-demand architecture: the job communicates with the device (7) and polls for intermediate results.
- It then writes completion data to ClickHouse (8) (9), and finishes. If no capacity is available, the job waits until a token opens or until the job times out, whichever comes first.
The result is that fleet-wide test traffic stays below a rate that would trigger Cloudflare’s bot detection, while individual customers still get every configured test completed within their expected window.
A device can only run one test at a time, but it may have multiple WAN uplinks that each need testing within the same automation window. This means the scheduler must queue speed test jobs for devices that are already running a test. We implemented per-device contention logic: any job started on a device blocks other jobs for that device from executing until the first completes. Tests run serially—port 4 finishes, port 5 begins—and nothing is dropped. Every configured port gets tested within the automation window.
What’s next
The infrastructure behind speed tests generalizes well beyond this feature. Offline jobs in River Queue, firmware-enforced invariants, a split between the API database for intermediate state and ClickHouse as the system of record, and a stateless polling-based view layer in Dashboard give us a reusable pattern for any diagnostic that executes on the device and reports results asynchronously.
We're already building real-time packet route visualization and proactive anomaly detection across customer networks. If this kind of work sounds interesting to you, join us.
