The Workshop on Hot Topics in System Infrastructure (HotInfra'23) provides a unique forum for cutting-edge research on system infrastructure and platforms. Researchers and engineers can share their recent research results and experiences and discuss new challenges and opportunities in building next-generation system infrastructures, such as AI infrastructure, software-defined data centers, and edge/cloud computing infrastructure. The topics span across the full system stack with a focus on the design and implementation of system infrastructures. Relevant topics include hardware architecture, operating systems, runtime systems, and emerging applications.
Call for Papers
The HotInfra workshop is soliciting three types of paper submissions: regular research papers, industry papers, and work-in-progress papers:
- The regular research papers may include studies that have been published in top-tier systems and architecture conferences in the past year. We encourage submissions that showcase new concepts and approaches for developing new and emerging system infrastructures.
- The industry papers are encouraged to demonstrate the recent trends and demands of real systems infrastructures from the industry and have insightful discussions on the challenges and experiences of developing real system infrastructures from industry perspectives.
- The work-in-progress papers are encouraged to have new and crazy ideas in building future system infrastructure. We will favor submissions that have great potential to inspire interesting discussions, so it is fine if the work has only an early version of the system prototype.
HotInfra'23 welcomes submissions on any topic related to system infrastructure and platforms. Specific areas include but are not limited to:
- Systems architecture and hardware devices
- Operating systems and runtime systems support
- Resource management and task scheduling
- Empirical evaluation of real infrastructures
- Security and reliability within new infrastructures
- Energy efficiency and renewable energy supply
- Emerging applications and cloud services enabled
- Hardware virtualization
- System-building approaches
HotInfra'23 submissions must be no longer than two double-column pages excluding references. All the accepted papers can be presented in the poster session by default. We will post presentation slides and accepted papers on the workshop website. There will be no formal proceedings. Therefore, the authors can extend and publish their work in other conferences and journals. The HotInfra'23 workshop will also invite talks from industry and academia.
Please submit your work here.
May 5, 2023May 12, 2023
- Author Notifications: May 29, 2023
- Final Papers due: June 9, 2023
- Workshop: June 18, 2023
Location: Magnolia 14, Marriott World Center Orlando
(All times are in EST)
A 6-word story on the future of Infrastructure: AI-driven, Software-defined, Uncomfortably Exciting
Parthasarathy Ranganathan (Google)
We are at an interesting inflection point in the design of computing systems. On one hand, demand for computing is accelerating at phenomenal rates, powered by the AI revolution and ever deeper processing on larger volumes of data, and amplified by smart edge devices and cloud computing. On the other hand, Moore’s law is slowing down. This is challenging traditional assumptions around cheaper and more energy-efficient systems every generation, and leading to a significant supply-demand gap for future computing systems. In this talk, we discuss how this current computing landscape motivates a significant rethinking of how we design future infrastructure. We present two broad themes around (1) efficient systems design through custom silicon accelerators and (2) efficient systems utilization through software-defined infrastructure. We will summarize our experience in these areas, and discuss key learnings and future opportunities for innovation. Looking ahead, we will highlight some additional grand challenges and opportunities for the community, specifically touching on key themes around agility, modularity, reliability, and sustainability, as well as the disruptive potential of cloud computing, and the opportunities beyond compute, around storage.
Parthasarathy (Partha) Ranganathan is currently a VP, technical Fellow at Google where he is the area technical lead for hardware and datacenters, designing systems at scale. Prior to this, he was a HP Fellow and Chief Technologist at Hewlett Packard Labs where he led their research on systems and data centers. Partha has worked on several interdisciplinary systems projects with broad impact on both academia and industry, including widely-used innovations in energy-aware user interfaces, heterogeneous multi-cores, power-efficient servers, accelerators, and disaggregated and data-centric data centers. He has published extensively (including being the co-author on the popular "Datacenter as a Computer" textbook), is a co-inventor on more than 100 patents, and has been recognized with numerous awards. He has been named a top-15 enterprise technology rock star by Business Insider, one of the top 35 young innovators in the world by MIT Tech Review, and is a recipient of the ACM SIGARCH Maurice Wilkes award, Rice University's Outstanding Young Engineering Alumni award, and the IIT Madras distinguished alumni award. He is also a Fellow of the IEEE and ACM, and is currently on the board of directors for OpenCompute.
Session 1: AI Infrastructure: How Does It Look Like in the Future
Session Chair: Jian Huang (UIUC)
To virtualize or not to virtualize AI Infrastructure: A perspective
Abstract: Modern data-driven applications (such as AI training, Inference) are powered by Artificial Intelligence (AI) infrastructure. AI infrastructure is often available as bare-metal machines (BMs) in on-premise clusters but as virtual machines (VMs) in most public clouds. Why is this dichotomy of BMs on-prem and VMs in public clouds? What would it take to deploy VMs on AI Systems while delivering baremetal-equivalent performance? We will answer these questions based on experiences building and operationalizing a large-scale AI system called Vela in IBM Cloud. Vela is built on open-source Linux KVM and QEMU technologies where we are able to deliver near-baremetal (within 5% of BM) performance inside VMs. VM-based AI infrastructure not only affords BM performance but also provides cloud characteristics such as elasticity and flexibility in infrastructure management.
Hardware-Assisted Virtualization for Neural Processing Units
Abstract: Modern cloud platforms have deployed neural processing units (NPUs) to meet the increasing demand for machine learning (ML) services. However, the current way of using NPUs in cloud platforms suffers from either low resource utilization or poor isolation between multi-tenant application services due to the lack of system virtualization support for fine-grained resource sharing.
In this paper, we investigate the system virtualization techniques for NPUs across the entire hardware and software stack. In the hardware stack, we design a hardware-assisted multi-tenant NPU for fine-grained resource sharing and isolation. It employs an operator scheduler on the NPU core to enable concurrent operator executions and flexible priority-based resource scheduling. In the software stack, we propose a flexible vNPU abstraction. We leverage this abstraction to design the vNPU allocation, mapping, and scheduling policies to maximize resource utilization while guaranteeing both performance and security isolation for vNPU instances at runtime.
General-purpose processing on AI/ML accelerators
Abstract: The emergence of novel hardware accelerators has powered the tremendous growth of machine learning in recent years. These accelerators deliver incomparable performance gains in processing high-volume matrix operators, particularly matrix multiplication, a core component of neural network training and inference.
In this work, we explored the direction of using AI/ML accelerators for applications beyond AI/ML. We first examine opportunities of accelerating database systems using NVIDIA's Tensor Core Units (TCUs). We present TCUDB, a TCU-accelerated query engine processing a set of query operators including natural joins and group-by aggregates as matrix operators within TCUs. Matrix multiplication was considered inefficient in the past; however, this strategy has remained largely unexplored in conventional GPU-based databases, which primarily rely on vector or scalar processing. We demonstrate the significant performance gain of TCUDB in a range of real-world applications including entity matching, graph query processing, and matrix-based data analytics. TCUDB achieves up to 288x speedup compared to a baseline GPU-based query engine.
We then extended AI/ML accelerators with more operations in support of more workloads. We propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. SIMD2 provides up to 38.59x speedup and more than 10.63x on average over optimized CUDA programs, with only 5% of full-chip area overhead.
Session 2: Memory, Memory, Memory!
Session Chair: Dimitrios Skarlatos (CMU)
Exploring Memory Expansion Designs for Training Mixture-of-Experts Models
Abstract: Machine learning (ML) has achieved impressive success in a variety of field. It is widely recognized that increasing model parameters can enhance ML model capabilities. As a result, the sizes of ML models have grown exponentially in recent years. However, this expansion in model parameters introduces computational and memory-related challenges. Mixture-of-Experts (MoE) has emerged as a potential solution to mitigate computational costs. In traditional dense models, the computational cost increases linearly with the model size since all parameters are involved in the training process. MoE models differ from traditional dense models as they partition and selectively activate a subset of parameters, resulting in improved model quality without substantially increasing computational costs. However, MoE models require additional memory capacity, leading to the adoption of high-capacity, expensive HBM for quick off-GPU memory access. As GPU memory has not scaled proportionately, it has given rise to the challenge known as the GPU memory wall. Memory expansion techniques can help overcome the GPU memory wall challenge by enabling GPUs to access remote memory. Although memory expansion has been studied for decades, it has not been explored in the context of training MoE models. This study aims to investigate a range of memory expansion design options, with the goal of optimizing both performance and performance per cost for training 1 Trillion parameter MoE models.
Memory Disaggregation: Open Challenges in the Era of CXL
Abstract: Compute and memory are tightly coupled within traditional datacenter servers. Large-scale datacenter operators have identified this coupling as a root cause behind fleet-wide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applications can leverage available memory even outside server boundaries. In this paper, we discuss some open challenges from a software perspective toward building next-generation memory disaggregation systems leveraging emerging cache-coherent interconnects.
On the Discontinuation of Persistent Memory: Looking Back to Look Forward
Abstract: With Intel's announcement to discontinue its Optane DC Persistent Memory (DCPMM) in July 2022, it's time to learn from our existing experience and look to its future. In this paper, we have 1) carried out a survey of public reports from organizations to understand how they utilize DCPMM; 2) measured the performance of the DCPMM 200 series to understand whether it could have saved the product; and 3) discussed with corresponding developers in major IT companies. Based on such information, we argue the memory mode of DCPMM is worth more attention and it is necessary to study the sweet spots for persistent memory before heavy investment.
What's Hot? Post-Moore Datacenter Architecture
Babak Falsafi (EPFL)
Datacenters are the pillars of a digital economy and modern-day global IT services. The building blocks for today's datacenters are cost-effective volume servers that find their roots in the basic hardware and OS organization of the desktops of 90s with a fundamental mismatch with datacenter workloads and services. Meanwhile, there are many technological trends (e.g., slowdown in Moore's Law), application trends (e.g., rapid adoption of AI) and societal challenges (e.g., climate impact of computing) that dictate innovation in datacenter design from algorithms to housing infrastructure. Post-Moore datacenters are hot because of both their trajectory to consume (and dissipate) unprecedented levels of energy and the many hot research avenues to pursue for an infrastructure whose building blocks belong to the 90s. In this talk, I will motivate and go over these research avenues.
Babak is a Professor and the founding director of EcoCloud at EPFL. His contributions to computer systems include the first NUMA multiprocessors built by Sun Microsystems (WildFire/WildCat), memory streaming integrated in IBM BlueGene (temporal) and ARM cores (spatial), and performance evaluation methodologies in use by AMD, HP and Google PerfKit. He has shown that memory consistency models are neither necessary nor sufficient to achieve high performance in servers. These results led to fence speculation in modern CPUs. His work on workload-optimized server processors laid the foundation for the first generation of Cavium ARM server CPUs, ThunderX. He is a recipient of an Alfred P. Sloan Research Fellowship, and a fellow of ACM and IEEE.
Session 3: System Efficiency: The Classic Never Dies
Session Chair: Jian Huang (UIUC)
Giving Old Servers New Life at Hyperscale
Abstract: To address the threat of climate change, new methods are needed to reduce the significant and increasing carbon emissions of datacenter computing systems. Minimizing hardware waste is crucial to achieve carbon reductions, as previous research has shown that up to 50% of datacenter emissions are "embodied" emissions resulting from the manufacturing and transport of server hardware. This study takes a first step in understanding how older hardware can be reused in a datacenter setting while preserving end-to-end service performance. We perform the first set of experiments to analyze the end-to-end and per-microservice performance of a datacenter application on two different server types and generations. The results of our experiments identify specific operating regions where older hardware does not degrade application performance. Our work motivates a scheduling system that exploit microservices' behavior on certain hardware generations to prevent environmentally costly server refreshes and hardware waste.
Sidecars on the Central Lane: Impact of Network Proxies on Microservices
Abstract: Cloud applications are increasingly moving away from their traditional monolithic nature and are adopting the structure of complex loosely-coupled microservices. Service meshes are widely used for implementing microservices applications mainly because they provide a modular architecture for modern applications by separating operational features from application business logic. Sidecar proxies in service meshes enable this modularity by applying security, networking, and monitoring policies on the traffic to and from services. To implement these policies, sidecars often execute complex chains of logic that vary across associated applications and end up unevenly impacting the performance of the overall application. Lack of understanding of how the sidecars impact the performance of microservice-based applications stands in the way of building performant and resource-efficient applications. To this end, we bring sidecar proxies in focus and argue that we need to deeply study their impact on the system performance and resource utilization. We identify and describe challenges in characterizing sidecars, namely the need for microarchitectural metrics and comprehensive methodologies, and discuss research directions where such characterization will help in building efficient service mesh infrastructure for microservice applications.
128-bit Addresses for the Masses (of Memory and Devices)
Abstract: The ever growing storage and memory needs in computer infrastructures makes 128-bit addresses a possible long-term solution to access vast swaths of data uniformly. In this abstract, we give our thoughts regarding what this would entail from a hardware/software perspective.
Session 4: New Infrastructure New Future
Session Chair: Muhammad Shahbaz (Purdue University)
Building Next-Generation Software-Defined Data Centers with Network-Storage Co-Design
Abstract: Software-defined networking (SDN) and software-defined flash (SDF) have been becoming the backbone of modern data centers. They are managed separately to handle I/O requests. At first glance, this is a reasonable design by following the rack-scale hierarchical design principles. But it suffers from suboptimal end-to-end performance, due to the lack of coordination between SDN and SDF.
In this paper, we take an initial effort towards building the next-generation software-defined data center by co-designing the SDN and SDF stacks. We redefine the functions of their control plane and data plane, and split them up with a new architecture named NetFlash. NetFlash has three major components: (1) coordinated I/O scheduling, to coordinate the effort of I/O scheduling across the network and storage stack achieve predictable end-to-end performance; (2) coordinated garbage collection (GC), to coordinate the GC across the SSDs in a rack to minimize their impact on incoming I/O requests; (3) rack-scale wear leveling, which enables global wear leveling among SSDs in a rack by periodically swapping data, for achieving improved device lifetime for the rack.
Don't Let Your LEO Edge Fade at Night
Abstract: The Low Earth Orbit (LEO) satellite edge has emerged as a promising solution to alleviate data congestion on satellite-ground links. However, existing approaches either offer inflexible fixed-function deployments or focus solely on addressing infrastructure mobility. In this paper, we shed light on the unique challenges posed by the varying energy harvested by satellites, which necessitates a fresh perspective on orchestration within satellites. Our work serves as a compelling call to integrate energy as a first-class metric for orchestrating applications within the LEO satellite infrastructure, posed as the new frontier of computing infrastructure.
Fine-Grain Slicing of Edge Cloud Servers for Radio Workloads
Abstract: Autoscaling edge servers introduces additional processing time due to the rigid configuration of containers' resource limits, rendering it unsuitable for delay-sensitive Radio Access Network (RAN) slicing workloads. We propose a fine-grained autoscaling for RAN slicing workloads in edge cloud. We introduce an analytical approach that dynamically tunes resource limits based on the inherent fluctuations in radio workloads using stochastic decision processes. Accordingly, we enable the system to effectively meet the dynamic demands for resources. We show preliminary results utilizing the Roofline model and asymptotic analysis of the Low Density Parity Check (LDPC) decoding algorithm as a comprehensive case study for radio workload characterization. Leveraging this analysis, we establish processing time design constraints. Furthermore, we discuss the limitations of our proposed model and delineate avenues for future research.