In a Cloud-Native World, Resiliency Equals Confidence

5G 
May 1, 2023

If there’s a common theme for Communication Service Providers (CSPs) migrating to cloud-native networks, it’s “more.” More disparate, distributed components of traditionally monolithic network elements. More vendors providing cloud-native network functions (CNFs), and updating them more often. More hyperscalers jockeying to convince CSPs to deploy on their clouds. And ultimately in this complex and dynamic cloud world, more things that can go wrong. Which makes lab and pre-production testing even more important. 

It’s not just that CNFs themselves are new. It’s that the way they interact with each other and the cloud they’re deployed on is so radically different, CSPs and vendors are still figuring out what to test for. In many cases, what a given CNF actually needs from the underlying cloud infrastructure is still an open question. Yet, if you’re part of a CSP organization aiming to deliver high-quality service experiences—especially under service-level agreements (SLAs)—it’s a question you need to answer. 

How can you make sure the 5G CNF you’re deploying will work as expected when pushed into production—in a cloud environment you don’t own, that behaves in ways you can’t predict? How can you ensure that cloud will provide the right performance attributes—networking, storage, memory, latency—to meet your SLAs? Just as important, how can you predict what will happen if it doesn’t provide the needed performance, and what you’ll need to do to respond? 

The answer to all these questions is the same: thorough resiliency testing of every CNF. You may not know exactly how the cloud you’re deploying on will perform on a given day. But by understanding exactly what each CNF needs from the cloud and how each CNF is vulnerable, you can avoid many problems before they arise. 

A Brave New World of Testing

The basic goals of preproduction testing haven’t changed. You need to understand how each CNF behaves, in isolation and in tandem with other CNFs and the cloud, to help you avoid unplanned issues and assure reliable, high-performing services. The way you pursue those goals, however, is vastly different from yesterday’s networks.

In the past, testing core network functions involved deploying a monolithic physical appliance or networking software wrapped bundles with vendor-supplied virtual machines into a lab environment. Everything the network function needed was supplied integrated and turnkey. And while the production network operated at larger scales, you could assume that, like the lab, it would be a relatively stable, predictable environment. 

In the cloud-native world, that assumption no longer applies. Start with the 5G core itself, which is now composed of dozens of different CNFs, potentially from different vendors, each disaggregated into scores of pods and deployed across numerous individual nodes—each with its own security, management, and performance considerations. Additionally, those pods now run on a dynamic cloud infrastructure layer, usually in a hyperscale cloud environment that’s difficult to predict or emulate.

These changes affect testing in multiple ways:

  • It’s harder to simulate reality. Some pre-production teams struggle to adapt to the realities of cloud-native operations, even as they test against them. They’ll set up a small cloud in the lab, load the CNF, and test in the same standalone way they always have. Such testing does provide insights but misses a key reality: the production cloud that CNF runs on will not be under their control.
  • Performance is more vulnerable. A production CNF might be deployed in 100 or more different pods, each handling different aspects of application, database, management, and so on. To deliver a 5G service (and maintain an SLA), all those pieces must communicate within certain latencies—often milliseconds—or they’ll timeout. If any link in the chain gets delayed or disrupted, failures can quickly cascade, ultimately resulting in 5G service failures.
  • You have to expect the unexpected. When SLAs depend on disaggregated, distributed microservices interacting in just the right way, at just the right time, many things can go wrong—and something inevitably will. Whether due to a noisy neighbor, slow cloud network fabric, a software bug, or other issues, one or more links between CNF pods will break or timeout. If you haven’t tested ahead of time to understand how the CNF behaves in this circumstance and how you should respond, you’ll be scrambling to do it in production.

5G CNF Microservice Architecture

And if you don’t proactively characterize the performance that each CNF needs from the cloud? You’ll push CNFs into production, something will go wrong, and you won’t be prepared. You’ll find yourself on an all-hands call with all your CNF vendors and your cloud team, all trying to troubleshoot the problem—but now, with the potentially multi-million-dollar costs of unplanned downtime.

Getting Resiliency Testing Right

Ultimately, you need to understand what every CNF needs to succeed, and identify every potential point of failure. That requires comprehensive resiliency testing—including under real-world traffic and challenges, not just ideal conditions. You need to:

  • Test with impairments. Make the test environment more realistic by injecting impairments within and between CNFs. Test for pod failures, resource contention, latency within the architecture, and other issues that commonly occur in production but typically don’t in simpler test systems.
  • Perform key failure indicator (KFI) assessments. Establish resiliency baselines for 5G CNFs in the lab, so you can proactively identify KFIs ahead of time. Make sure to add KFI summaries to service assurance platforms, so they can be referenced in production.  
  • Test for degradations, not just failures. Testing historically focuses on idealized performance and outright failure cases—such as a server or link between data centers going down. But when your business is tied to 5G SLAs, it’s equally important to understand what happens if storage is slower or the system has more latency than expected. Identifying and mitigating such issues is much easier in the lab than in production.

Unleash Cloud Value

Resiliency testing can be more complicated in cloud-native environments, but it offers richer rewards. Get it right, and you can push new CNFs into production without worrying about getting surprised. You’ll have fewer issues in the live network, and you can respond more quickly when you do.

Just as important, you enable a more agile and effective business. Now, you can take full advantage of cloud-native efficiencies to reduce power consumption and operating costs. You can dynamically scale network resources instead of overprovisioning, and unlock huge capital savings. And you can offer more stringent and lucrative SLAs, knowing that you’re ready to deliver—even in a dynamic, unpredictable world.

Contributed by

Spirent Communications

Country: United Kingdom
View Profile