Application composition is a critical capability that will be the foundation of the way extreme-scale systems must be used in the future. The high-performance computing (HPC) community is already seeing the need for tighter integration of modeling and simulation with advanced analysis, and ad hoc solutions for coupling multiple simulations as well as integrating simulation and analysis are being developed and deployed. These ad hoc approaches are often hindered by system interfaces not designed to provide the full semantic capability required, making it difficult to deliver scalable high-performance implementations.
A recent workshop report published by the U. S. Department of Energy (DOE) describes the challenges facing OS/Rs for future extreme-scale scientific computing systems. Many of these challenges, such as the need for increased reliability and the desire to reduce power and energy use, are largely driven by limitations in hardware technology, and the computer architecture community is vigorously pursuing potential approaches and solutions. While the OS/R is an important component in exploiting hardware-based solutions, addressing near-term hardware limitations without considering application composition will lead to incomplete solutions. A more effective approach is to consider these challenges in the context of application composition and to allow for integration of the hardware or algorithmic solutions required to address them.
Our goal is a flexible OS/R, capable of hosting existing petascale applications, and supporting new operating modes better suited to new architectures and programming models. To motivate our approach, we present a set of use cases that represent existing and developing codes important to the DOE mission.
The exploratory analytics use case represents application workflows that include a large-scale simulation that feeds an analysis or visualization code. The traditional approach performs analysis as a post-processing step, storing intermediate results to the file system. On extreme-scale systems, the sheer volume of data could make this approach impractical for many types of analysis. Instead, we are interested in workflows that execute simulation, analysis, and visualization concurrently. Examples of existing exploratory analytics applications include coupling shock physics with ParaView for fragment detection, a Quantum Monte Carlo (QMC) materials code coupled with a service to generate observables in a different coordinate system, and the Pixie3D magnetohydrodynamics (MHD) code coupled to PixiePlot and ParaView.
In the streaming analytics use case, a simulation produces small data products to be consumed in an event-based control-system. The events follow a distributed data-flow model. A streaming analytics application can be instantiated as a directed graph of computing components. It does not require the full features of traditional HPC runtimes, but does require mechanisms for event routing and transfer, proper placement in a network topology, and signals for event activation. Examples include topological analysis for the S3D combustion simulation code, and streaming network analysis.
The graph analytics use case represents a host of problems including simple statistics gathering of marketing profiles, strong correlating methods, and knowledge creation with inferencing and deduction on large data structures such as semantic networks. Graph-analytics codes often require massive multithreading and global shared memory not supported by traditional HPC runtimes. Examples include social and economic modeling, as well as modeling of the power grid.
The code coupling use case demonstrates interactions between concurrent HPC simulations. Although the coupled applications execute concurrently, they may have dramatically different requirements for programming models and runtime support. Data sharing and synchronization needs may benefit from co-locating portions of the codes on the same physical nodes. Examples include coupling the XGC1 fusion code to the reduced-model code XGC-1A to allow for longer running times, and the LIME multiphysics coupling environment.
The application frameworks use case focuses on approaches to coordinate execution and management of analysis results for large numbers of parallel jobs, for example, to conduct parameters studies or sensitivity analyses. This example represents the Many-Task Computing approach used by frameworks such as Falcon, Swift, and the Integrated Plasma Simulator (IPS) . On current HPC platforms, frameworks make heavy use of scripting languages (e.g., Python) and communicate results through a global file system; however, we do not expect this to be a practical use of the file system at extreme scales.
Our ability to develop technologies that meet the requirements while minimizing overhead and carefully managing application isolation is critical. Our approach leverages and extends technologies developed for the Kitten LWK and the Palacios lightweight virtualization system as the starting point for our work, with substantial restructuring to meet the needs of the Hobbes software stack.
Ad hoc composition of applications is already taking place in the HPC community, typically requiring that applications be adapted to a common runtime environment. In contrast, we propose to support all of the use cases described in the previous section with a unified application composition framework. This framework will support the composition of applications developed for different programming models (e.g., MPI and UPC) with potentially incompatible RTSs. Each application will have a single runtime environment and will run in an independent enclave. As such, application composition becomes enclave composition.
Although Hobbes focuses on coarse-grain composition of applications, where there is a clear separation in runtime requirements, we do not prevent finer-grain compositions, such as the desire to mix multiple threading packages within the same component (e.g., OpenMP, Threading Building Blocks, and pthreads). While this is fundamentally an application/runtime issue, our thread creation and scheduling mechanisms are flexible enough to support all common threading models. One key challenge with our approach is the possibility that the applications being composed may have markedly different runtime requirements. For example, one application may have minimal runtime requirements, while another might require a full-featured OS, like Linux.
Another challenge involves the mapping of enclaves onto the compute nodes of an extreme-scale system. The most straightforward mapping would allocate independent nodes for each enclave, using the interconnection network to join (compose) the enclaves (applications). While this mapping is easy to implement, performance considerations dictate that we also support mappings based on direct sharing of node-level resources. For example, in the Exploratory Anlalytics use case, a small filter needs to be interposed between the application and visualization enclaves. While we will describe the use case as a standard composition of applications, our intent is that the implementation will be lightweight (e.g., a function call) and will reduce the data communicated to the visualization enclave. Ultimately, mapping decisions must be based on a careful analysis of performance tradeoffs and it is our intention to maintain a separation between composition and mapping: composition is used to describe computations, while mapping defines the physical resources used to realize these computations.
It is worth noting that our approach in mapping enclaves to shared physical resources provides the opportunity to, effectively, move code to data, thus reducing (network) data movement and thereby power consumption, as well as increasing performance. With the addition of appropriate trust, sandboxing, and resource management mechanisms, the same approach can be used to allow interactions between applications and shared services. For example, an application could push code to an enclave implementing a storage service (i.e., active storage ).
To address the challenges of variable runtime requirements and complex application mappings, we will develop a wide range of virtualization technologies. These technologies will support the definition of virtual compute nodes, enabling the full system virtualization that can be used to support almost any RTS that could be required. By supporting the creation of multiple virtual nodes from a single physical node, these technologies will also enable sharing of the physical resources available on a compute node.
The System-Global OS (SGOS) is the portion of the system responsible for scheduling, monitoring, controlling, and coordinating the resources of a single system. We pay particular consideration to how applications use shared system services, such as shared storage or visualization systems; indeed, the need to effectively use shared services is a primary motivator for our approach to application composition. In Hobbes, however, a shared service is simply a specialized, possibly long-running application whose enclave may include specialized nodes, e.g., storage or visualization nodes.
The APIs for the the SGOS include interfaces to a scheduling and resource management subsystem responsible for mapping enclaves and assemblies of enclaves onto physical resources; interfaces to hardware management and control systems that manage information about the health and productivity of associated hardware and software in the system, possibly gathered from sensors; interfaces for autonomic management that support a wide range of policies for adapting resource allocations, for example based on power or resilience constraints; and interfaces to basic global services like authentication and authorization.
In addition to the core APIs for the SGOS, application composition requires a much richer specification of jobs than in the past. Job descriptions may include OS/R selections for enclaves, information to guide the mapping of the logical computation structure onto the physical resources, and policies to be enforced by the autonomic management systems. We will draw on related work in the cloud and grid computing communities to inform our designs.
An enclave is a partition of the system allocated to a single application or service. As such, an enclave is primarily a container and a unit of coordination. The Enclave OS/R (EOS) as a distinct component of the OS architecture extends the enclave concept with functionality, in much the same way that an object extends a data structure with functionality. The primary reason for introducing the EOS as a distinct component of an extreme-scale OS is to allow each enclave to operate with a degree of autonomy, rather than current practice of explicit top-down control of all aspects of the system from the equivalent of the SGOS.
The APIs of the EOS provide support for enclave membership, collective operations for launching, terminating, pause, (compose) the enclaves (applications) and resume enclaves; and autonomic management similar to what is provided in the SGOS. Enclave membership is a key capability of the EOS. Since we intend to support dynamic enclaves, negotiations for addition or removal of resources may be initiated by the RTS, the SGOS, or by the EOS itself, for example due to a node failure.
The composition of enclaves in order to assemble complex applications is the most important and novel feature of the EOS. The idea that, by definition, each enclave uniformly executes a single OS/R configuration also implies that each one can be tailored to the needs of the application. Composition of enclaves, then, is the selective breaking of the normal isolation between enclaves to allow direct interactions between Node OS/R instances. At the lowest level, composition of enclaves on distinct sets of nodes involves the network, while enclaves on virtual nodes co-located on the same physical nodes share hardware (e.g., memory). However, we propose to develop suitable abstractions to provide a uniform mechanism so users can decide how to map composite applications onto the hardware simply based on the performance requirements of the coupling.
The Node OS/R (NOS) provides interfaces and abstractions to the underlying compute, memory, and network resources and also provides APIs for use by the Enclave OS/R (EOS) and System Global OS (SGOS) to support higher-level allocation and management of node-level resources. Important features of the NOS relate to managing compute, memory, and network resources. The NOS will need to expose compute capabilities from an increasingly complex set of compute resources, for example logic layers embedded in sophisticated devices like the Hybrid Memory Cube. Memory hierarchies will become much more complex and the NOS needs to provide interfaces allowing the runtime to manage this memory effectively. Finally, to maximize network efficacy and isolation, we will develop an infrastructure for differentiated network services that provides latency and bandwidth guarantees to traffic participants.
The Node Virtualization Layer (NVL) provides management of virtual node instances on physical nodes and provides hardware abstractions, through the Hardware Abstraction Layer (HAL). The NVL, which is not called out in the workshop report, is central to our approach to composition, as well as providing a flexible environment capable of supporting the diverse Node OS/R requirements of applications.
The NVL must handle three distinct use cases: (1) the ability to support an application running with a minimal (native) node OS; (2) the ability to support full featured operating systems; and (3) the ability to create a multiplicity of virtual nodes to support the sharing of physical resources between applications in distinct enclaves. In previous work, we used the Kitten/Palacios code base to address the first two cases. The third use case provides motivation for much of the fundamental research that will be required for Hobbes. The mechanisms developed to create multiple virtual nodes must be efficient enough to realize the full benefit of shared resources, but flexible enough to provide the required degree of isolation between enclaves.
The Global Information Bus (GIB) is the logical software layer that encapsulates mechanisms for sharing status information needed by other components in the OS/R. Core functionality in the GIB will support the gathering, publication and subscription of status information. This includes basic performance information (e.g., current energy consumption for each node) and information defined by a runtime (e.g., length of the unexpected message queue for MPI). Information communicated through the GIB has two distinguishing characteristics: first, the information is communicated in small parcels (perhaps a few words) and second, there is no expectation of coherence in this information (it is very unlikely that a scan through all of the nodes would result in a consistent view of the machine). The intention is to provide hints about the status of nodes that can be used in initial management decisions by the EOS, an adaptive runtime system, or tools.
In many respects, the GIB represents an abstraction of the information collection and dissemination portion of the SGOS Hardware Monitoring and Control (HMC) subsystem, which on high-end HPC systems is implemented on distinct Reliability, Availability, and Serviceability (RAS) hardware. We expect to implement the functionality of the GIB on the RAS system where possible, but also using the main high-performance network on commodity clusters that may not have a separate RAS system.
Power and energy consumption will be significant constraints on next-generation systems. Accordingly, we need to ensure that Hobbes is power aware so that it can be a suitable tool for power management research, as well as production operation of power-managed systems.
Hobbes provide APIs for measurement and control of power throughout the system. We will leverage our recent work to develop a broad system power API specification. The power-measurement API will support measurement at the component, node, enclave, and system granularities (at a minimum). The API will also facilitate measuring existing and new processors, memory, or network technologies. The power control API provides a mechanism to manage the power consumption of devices that expose these features. Although this type of control is not common among computing devices, recent activity to expose more controls for managing power is encouraging. For example, power clamping has recently been introduced on Intel Sandy Bridge processors, and similar mechanisms exist on IBM Power-7 and AMD Bulldozer processors. Our API will provide an abstraction to explicitly control these devices as well as reason about their usage.
Virtually every one of the APIs defined in the earlier sections will have a scheduling component. A key challenge is the design of bi-directional interfaces to support coordination of scheduling decisions between multiple layers in the OS/R stack. These interfaces must support scheduling-related dialog between an upper-level component that establishes policy and a lower-level component that provides the mechanisms needed to implement the policy. As an example, the SGOS establishes policies for the full machine; however, the mechanisms needed to affect these policies will be provided by the EOS, the NOS and possibly the NVL. The needed dialog is frequently complicated by the fact that the lower-level component is the first to be invoked when a scheduling decision is needed.
The OS/R must handle faults in all of its layers, masking faults when appropriate, and providing sufficient information and mechanisms to other OS layers and applications to handle unmasked faults. We expect Hobbes to be a vehicle for research in resilience mechanisms to handle faults in different layers of the software stack and to understand fault sensitivity and coverage.
One area of particular interest is the development of mechanisms that identify critical OS/R data structures (those most susceptible to failures and critical for stability) and creating a methodology for resilient OS/R data structures. For example, previous analysis of dynamic memory profiles of Kitten and Linux indicates ample opportunities for enhancements, such as using hashes, redundant data, and self-checks and repairs, of OS/R data structures for specific allocation regions (e.g., memory/process management, kernel/shared memory), which we will incorporate into the Hobbes OS/R components as appropriate.