How the smartNIC architecture supports scalable infrastructure
In recent years, networking software has eclipsed its hardware companions. Although routers, switches, and chipsets currently lack the most exciting innovations, progress is being made.
Such a change occurs with the network interface card (NIC). A traditional network adapter receives and sends packets between a network and a server. It might not be fancy, but it gets the job done. Recent innovations, however, add programmability, memory, computation, and other capabilities to network cards to strengthen hardware. The result is a smartNIC.
Cloud providers have implemented the smartNIC architecture to support their increasingly distributed cloud environments and reduce stress on their servers. As they expand their infrastructures, many are using smartNICs to provide connectivity, add storage capacity, and perform other functions, such as processing and telemetry.
While businesses don’t need the same scale as cloud providers, they can still benefit from the smartNIC architecture. One of the indisputable advantages for businesses is telemetry, according to Silvano Gai, author of Building a future-proof cloud infrastructure from Pearson.
“Telemetry is very important because in businesses there are a lot of fingerings,” Gai said. “Without hard data, it’s really hard to diagnose.” SmartNICs can integrate telemetry and extract valuable network data that helps businesses gain granularity visibility to diagnose and resolve network problems.
In his book, Gai explains how distributed service platforms have evolved and explores the different components required to support this infrastructure.
Editor’s Note: The following interview has been edited for length and clarity.
How are platforms distributed between companies evolving?
Silvano Gai: If you look at the classic architecture of an enterprise data center, it is clearly a distributed architecture. But above all, it is still an architecture in silos. You have server network silos for HR, engineering or whatever. There is no real concept of multi-tenancy. Multi-tenancy is essentially achieved by creating several silos in the network.
The public cloud is completely different. It’s multi-tenant by definition with integrated multi-tenant from day one. When you create a multi-tenancy in your network from day one, you don’t need a silo. You can put different users on the same server but with different [virtual] different machines or containers, and they can share their resources. Then you can have a policy to secure those resources.
Normally, people talk about changing the calculus. One processor is not enough, so you put 10, 100 or thousands to upgrade the processor. But the cloud is also changing the service. They said, “The only way for us to survive in this multi-tenant environment is that every time we install a new server, we also adapt the services related to that new server. We are not only expanding the compute, but also the service. ‘ And, when I say services, I mean classic firewall, load balancer, encryption, things like that.
When you look at companies, they don’t do that. Businesses have all of these silos, and they’re putting appliances in place – like Palo Alto firewalls, an F5 load balancer, Cisco, Juniper, Arista, whatever – to basically keep the silos separate. It is a much less scalable architecture. It also means that the network is getting weird with this effect called traffic paperclip, in which you go through the network several times to go to the appliance, bounce, and everything is not really optimized.
Now, how has the cloud changed the service? Well, they tried some software, and it didn’t really work. They said, “We need a footprint where we can perform the services. This fingerprint was essentially identified at the border between the server and the network. With this device – people call it smartNIC, DPU [data processing unit], EPU [energy processing unit], and there are more names than products – you not only provide connectivity for the network and possibly storage, but you also provide the implementation of utilities. And, if you already support [the services], then you also get the performance.
Can you tell us more about these smartNIC developments?
Gai: I associate the transition from NIC to smartNIC with the fact that companies that build smartNICs have started to integrate processing inside the smartNIC. And, 99% of [time], it is in the form of an Arm core [processor]. Of course, to run a processor, you also need dedicated memory.
So, the cost of smartNIC, due to processor and memory, is very different from the cost of a NIC, usually two to three times higher. But, by inserting the processor, now you can write software and implement services, and then you have the performance. Basically, you do everything in the Arm processors – everything is SQL. This is part of the pros and cons; it’s easy to program. But the performance you get is not that good, and so is the latency and jitter.
There are other approaches. Other companies have tried, for example, to use an FPGA – a field programmable gate array – and they are trying to program it. It also has advantages and disadvantages. FPGAs are power hungry, and the density is very low because you have to have all this programming logic and so on. The results have been mixed.
Other companies, like Pensando, have tried to adhere to a P4 architecture. P4 is a programmable way to write the data path. So, you use P4 for the smartNIC data path and use an Arm kernel to do the control path and the management path. There are combinations of this technique. Intel, with the acquisition of Barefoot, is also probably working on or announced a P4 smartNIC. But basically the transition from a NIC to a smartNIC is when you add the programmability dimension.
Is there an enterprise use case for smartNICs?
Gay: The market is clearly dominated by cloud providers. But, in the enterprise, there is this great desire to build a private cloud to emulate what cloud providers have done in the public cloud. Thus, he entered the business, with many facilities [smartNICs] to get fruit at hand.
Believe it or not, the biggest fruit at your fingertips is telemetry, which measures what is happening on the network. After that, this is called a network socket, where you implement the ability to observe what is happening everywhere in a distributed fashion. And, of course, the business is more price sensitive than the cloud.
Who manages the smartNIC?
Gay: There are basically two distinct modes. The original mode, which I think won’t survive, is the operating system the smartNIC is installed on manages it. The reason it won’t survive, in my opinion, is that if the operating system is compromised, the smartNIC is compromised and all of your security is stored on the operating system.
Most smartNICs now have an external interface, either a gRPC interface or a REST API interface, and they can be managed over the network. They essentially present a PCIe [Peripheral Component Interconnect Express] firewall to the operating system, so the operating system cannot compromise this. If you manage to implement this successfully, the firewall on the smartNIC will not be compromised if the operating system is compromised, so you have the potential to contain an attack.
Do you think that most of the teams in the network are open to this change?
Gai: Everyone is conceptually open. From a pragmatic point of view, it is a little more difficult. It’s more difficult between the network team, the security team and the server team. In the enterprise, the security team has long relied on an appliance over which they have 100% control. However, this device is extremely expensive: for a modern business, a device can cost $ 1 million.
The smartNIC solution costs much less than that. But, on the other hand, it implies that the security team now has to control these smaller form factors, which are in much larger quantities. And he has to do it in some sort of coordination with the network team. Before, with coordinating the installation of an appliance, the network team would give them a bunch of IP addresses on different subnets, or VLANs. [virtual LANs] or VXLAN [virtual extensible LANs], but that was the extent of the coordination.
Now you have to coordinate the management. So I think the resistance is a bit of an organizational resistance. People are realizing it’s going to come. But the fact that this is going to happen does not immediately imply that people are careful in implementing it.