Setting Up Network Performance Testing Infrastructure - Victor Da Luz

After deploying Prometheus and Grafana for infrastructure monitoring, I wanted to add network performance testing to track both WAN and LAN performance over time. The goal was simple: detect performance degradation, identify bottlenecks, and have historical data to troubleshoot network issues before they became problems.

What I wanted to measure

Network performance matters for different reasons depending on where you’re looking. WAN performance tells you if your internet connection is living up to what your ISP promises, and it helps identify when connectivity issues are on your side versus theirs. LAN performance reveals bottlenecks in your internal network, whether it’s routing between VLANs, switch limitations, or problems with specific network paths.

I wanted both automated and historical. Manual testing is fine for troubleshooting, but it doesn’t show you trends or catch problems that develop slowly. Historical data lets you see if performance is degrading over time, which helps with capacity planning and proactive maintenance.

Choosing the tools

For WAN testing, I evaluated a few options. OpenSpeedTest and librespeed are great for manual web-based testing, but they require custom Prometheus integration. I wanted something that integrated cleanly with my existing monitoring stack. Speedtest-exporter fit perfectly - it’s a command-line tool with a Prometheus exporter that runs automated tests and exposes metrics directly.

One important note about speedtest tools: the Python-based speedtest-cli library is computationally expensive and frequently bottlenecks at 300-500 Mbps on low-power devices, failing to measure true Gigabit WAN speeds accurately. For accurate high-bandwidth monitoring, you should use the official Ookla C++ binary instead of the Python script.

For LAN testing, iperf3 is the industry standard for network performance testing. It’s designed exactly for this use case. I couldn’t find a reliable iperf3-exporter, so I used the node_exporter textfile collector pattern instead. This approach lets any script write Prometheus-formatted metrics to a file, and node_exporter picks them up automatically. It’s flexible and doesn’t require a dedicated exporter.

The approach

I deployed the network performance testing infrastructure as an LXC container on Proxmox, following the same Docker-in-LXC pattern I use for other services. Initially, I ran the speedtest-exporter as a Docker container that automatically runs speedtest-cli tests on a schedule. However, we quickly ran into rate limits from running tests too frequently. Instead of maintaining a separate speedtest instance, I changed the approach to hook into a speedtest that was already running in Home Assistant. If you go this route, be aware that the native Home Assistant Speedtest integration uses the Python speedtest-cli library, which has the same performance bottlenecks. To get accurate Gigabit results, you’d need a custom solution that wraps the official Ookla C++ binary, such as a shell command integration or a custom add-on, rather than the native integration.

For LAN testing, I created a script that runs iperf3 tests between configured node pairs. The script parses throughput and latency from the iperf3 output and writes metrics in Prometheus format to the node_exporter textfile collector directory. Node exporter picks up these metrics and exposes them to Prometheus, so everything flows through the same monitoring pipeline.

Balancing test frequency is important. Too frequent and you’re generating unnecessary network load. Too infrequent and you miss short-term issues. One thing to consider with WAN testing is bandwidth consumption. Running speed tests every 30 minutes on a Gigabit connection can consume 500MB to 1GB per test, which adds up to 700GB to 1.5TB per month just on testing. Make sure to verify your ISP’s Fair Use Policy or data caps before maintaining frequent test schedules, or reduce the frequency to every few hours.

Integration with existing monitoring

Adding the new metrics to Prometheus was straightforward. I added scrape targets for the speedtest-exporter endpoint and configured Prometheus to collect the iperf3 metrics from node_exporter. The metric relabeling feature in Prometheus lets you filter which metrics to keep, which helps focus on the performance data without cluttering your metrics database.

I created a consolidated Grafana dashboard that shows both WAN and LAN performance metrics in one place. The WAN section shows download and upload speed trends, latency and jitter over time, and test status. The LAN section shows inter-node throughput, latency between nodes, and test success rates. Having everything in one dashboard provides better context than separate views.

I also set up basic alerting rules for performance degradation. WAN alerts trigger when speeds drop significantly, latency is high, or tests fail. LAN alerts trigger when latency is unexpectedly high or throughput degrades. These alerts help catch issues early, before they impact users.

What the data revealed

The network performance testing infrastructure started collecting metrics immediately, and the historical data proved valuable quickly. But more interestingly, the testing revealed issues I didn’t know existed.

The performance testing showed that inter-VLAN routing was capped at around 500 Mbps, which seemed wrong for the hardware involved. Investigation revealed a QoS queue misconfiguration that was incorrectly applied to the wrong interface, limiting all inter-VLAN traffic. After fixing the configuration, inter-VLAN speeds improved from around 500 Mbps to over 800 Mbps.

That 800 Mbps result is still suboptimal. On standard Gigabit equipment, healthy TCP throughput should be around 940 Mbps on a 1Gbps link. The 800 Mbps result suggests there’s still overhead somewhere, likely from the overhead of the Linux bridge and veth pair traversal in the Proxmox LXC container, combined with CPU context switching required to drive the network stack at that speed, rather than a true network limitation. But it’s a significant improvement from the 500 Mbps cap, and the testing gave me data to identify and fix the configuration issue.

This is exactly why automated performance testing matters. Without it, I would have assumed the network was performing correctly. The metrics revealed the problem, provided data to troubleshoot with, and gave me a way to verify that the fix actually worked.

The textfile collector pattern

The node_exporter textfile collector pattern is worth highlighting because it’s powerful and underappreciated. Instead of writing a full Prometheus exporter for every custom metric, you can write scripts that generate Prometheus-formatted metrics and save them to a file. Node exporter reads the file and exposes the metrics, and Prometheus scrapes them like any other metric source.

This approach is simple, flexible, and perfect for metrics that come from scripts or one-off tools. You don’t need to maintain a full exporter or integrate with Prometheus client libraries. Just write your metrics in the right format and node_exporter handles the rest.

Lessons learned

Consolidated dashboards are better than separate ones. Initially I considered separate dashboards for WAN and LAN metrics, but having everything in one place provides better context and makes it easier to correlate issues. When WAN performance degrades, you can immediately see if LAN performance is affected, which helps with troubleshooting.

Alert thresholds need tuning. My initial alert thresholds were too sensitive and generated false positives. I adjusted them based on actual network performance patterns, which reduced noise and made the alerts more useful. This is a normal part of setting up monitoring - you start with conservative thresholds and refine them as you understand your infrastructure better.

Scheduled testing requires careful timing. Running tests too frequently creates unnecessary network load, especially for WAN tests that use actual bandwidth. Finding the right balance between detection speed and resource usage takes some experimentation.

The peace of mind that comes from having visibility into network performance is worth the setup effort. Instead of guessing whether the network is performing correctly, I can see exactly what’s happening and identify issues before they become problems. The historical data helps with capacity planning, and the automated testing catches issues that manual testing would miss.

Network performance testing might seem like overkill for a homelab, but it’s one of those things that pays off when you need it. When something feels slow, having data beats guessing. When you’re planning upgrades, historical trends beat assumptions. And when you’re troubleshooting, automated testing beats manual checks that you might forget to run.