Libre Domain-Specific Architectures

Introduction

Every couple of years I run across a presentation that impacts my way of thinking about (network) technology. This happened in 2011 when I saw Nick McKeown’s presentation at the Open Networking Summit in Stanford. His presentation was called “How SDN will shape networking” and it started a new way of designing and operating networks. Today the concepts of Software Defined Networking and network disaggregation are widely used.

A recent example is the 2017 ACM A.M. Turing lecture by John Hennessy and David Patterson. Their presentation was called A New Golden Age for Computer Architecture. They received the Turing award for their pioneering work on a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors. What I liked in their lecture was how they turned a setback (the end of Moore’s law) into an opportunity (new research areas in computer science). They argue that the answer to the end of Moore’s law will be domain-specific architectures. These architectures consist of processors with a domain-specific Instruction Set Architecture (ISA), corresponding (domain-specific) programming languages and security by design. In their opinion this requires multi-disciplinary teams that understand areas such as applications, compiler technology, processor architecture and security. This opens up new exiting research areas and that is why they predict a new golden age for computer architecture.

Domain-Specific Architectures (DSAs)

Moore’s Law and Dennard Scaling have flattened off in recent years. This means that we do not have the luxury any more of computers that double in performance every two years or so, just by relying on these scaling factors. We need to look for other ways to improve the performance of our computers. Careful analysis of the complexity of our algorithms and considering the efficiency of the programming language we choose for the job at hand becomes important again. But many experts think that we also have to complement general purpose CPUs with silicon that is targeted to a specific application domain and look at domain-specific architectures. We are already seeing this in the form of GPUs (when they are used to run programs instead of driving a display) and Google’s TensorFlow Processing Unit (TPU). These are examples of hardware and programming languages that are designed for a specific problem space.

DSAs in Networking

Domain-specific architectures have also entered the field of networking. Routers and switches use specialised network ASICs for forwarding network traffic at high speeds. These ASICs have always had functionality that was fixed during manufacturing and could not be changed in the field. Adding new protocols was only possible by designing a new ASIC. A process that could easily take five years. But in the last few years these ASICs have become field programmable. Most vendors only use this programmability internally by updating the forwarding silicon in their switches and routers and offering a new version of their closed proprietary firmware. Their customers see the new functionality via the CLI and APIs. But other vendors give customers the ability to program the forwarding silicon themselves. Some vendors, like Arista and Cisco, do this in a limited way by letting customers add additional functionality the firmware that runs on their routers and switches. This gives customers only a limited amount of programming flexibility. Intel on the other hand gives customers full programming control over their Tofino Ethernet ASIC. This is similar to the how you can run any program on a x86 CPU. The Tofino chip can be programmed by an open domain-specific language. In the case of Intel’s Tofino chip that is P4. P4 has a C-like syntax and P4 programs consist of packet headers definitions that describe the fields in a header and how many bits these have, a packet parser that dissects the (possibly user defined) protocols and lookup tables that determine what needs to be done with the packet. The Tofino silicon supports 64x100G Ethernet ports and with a maximum bandwidth of 6.4 Tbps and is used in programmable Ethernet switches. Intel offers a P4 SDK so that end-users can design and implement their own dataplanes on the switch.

Another area where domain-specific architectures have entered the world of networking is using smart NICs or network accelerators in compute servers. Several names and acronyms are in use for the same concept:

Smart NIC
Domain-Specific Accelerator (DSA)
Network Accelerator
Data Centre Accelerator
Data Processing Unit (DPU)

These accelerators are mostly used for offloading task from the main CPU cores of the server. This has been done already for a long time for things like e.g. checksum and segmentation offloading. But in recent years this has been extended to other functionality, like crypto functions, BPF offload, OVS offload, etc.

DSA Vendors

Below are some of the vendors in this space. Most accelerators are based either on general purpose ARM CPU cores (which are not domain-specific hardware) or on FPGAs. The exception is Netronome, which uses network processors.

Intel has several FPGA based NICs.

Netronome has the Agilio family of smart NICs that are based on Network Flow Processors (NFP) with specialised programmable cores. These NICs were P4 programmable, but recently Netronome is focussing more on BPF and OVS offload.

Pensando offers NICs that have P4 specific hardware, several ARM cores and High Bandwidth Memory (HBM). But it seems like these cards are not programmable by the end-user. Pensando sells a subscription to a cloud based manager that controls the cards, the PSM (Policy and Services Manager).

NVIDIA has the BlueField (via their Mellanox acquisition) family of Data Processing Units. These use ARM cores for programmability.

Xilinx has the Alveo Accelerator cards that are based on their high end FPGAs.

Libre DSAs

All big cloud companies are using accelerators. For example, Google is heavily using TPUs and Microsoft is using FPGA bases accelerators in Azure. However, their solutions are not publicly available. This is a pity because in order to make rapid progress it is important that knowledge and experience are shared freely so that innovators can build on the result of others. This means using open access publications and freely (as in libre) shareable hardware designs and open source software. Wikipedia has this to say about free versus libre:

Libre: The English adjective free is commonly used in one of two meanings: “for free” (gratis) and “with little or no restriction” (libre).

Sharing of code and sharing experiences is also useful in networking. Experimenting with new protocols has always been difficult because you had to wait for the big network vendors like Cisco and Juniper to implement the protocol; and they would only implement a new protocol when there was enough customer demand. That made it difficult to experiment with the latest protocols being discussed in the IETF. You were left with implementations on software switches, but this would give no assurance that it would work on high speed hardware too. With the emergence of programmable switches and programming languages like P4 this has changed. P4 has become popular in academia and many new networking ideas are now implemented in P4, often in hardware.

The Golden Age

John Hennessy and David Patterson have convinced me of their vision about the new golden age in computer architecture and the importance of domain-specific architectures. For me, this is includes network programmability at the edge (I agree that the core of the network should be a simple high-speed forwarding service). The edge consists of programable switches, accelerators and network stacks (in user or kernel space) in servers. Some popular domain-specific architectures are P4, XDP/BPF, DPDK, FPGA based solutions, etc. Each has its own place in this ecosystem. I think their are still many interesting open questions in this area. To name just a few:

How can the Linux network stack integrate more closely with accelerators?
Does it make sense to offload packet parsing?
How do you make a solution work on different hardware/accelerators?
Which abstractions are needed/useful?
What does a network domain-specific Instruction Set Architecture (ISA) look like?
Do application-specific protocols make sense in a data centre?
How and where would you implement them?
More radical ideas like The Case for a Network Fast Path to the CPU