Boris Dimitrov, and K. Mani Chandy
Caltech Department of Computer Science 256-80, Pasadena, CA 91125
In this paper, we describe three contributions for distributed resource allocation in scientific applications. First, we present an abstract model in which different resources are represented as tokens of different colors; in this model, processes acquire resources by acquiring tokens of appropriate colors. Second, we present distributed scheduling algorithms that allow multiple resource managers to determine custom policies to control allocation of the tokens representing their particular resources. These algorithms allow multiple resource managers, each with its own resource management policy, to collaborate in providing resources for the whole system. Third, we present an actual implementation of a distributed resource scheduling algorithm framework using our abstract model. This implementation uses Infospheres, which are Internet communication packages written in Java, and showcases the benefits of distributing the task of resource allocation to multiple resource managers.
Often in scientific computing, a user needs access to several distributed heterogeneous resources. For instance, consider a scientist conducting a distributed experiment  that requires a supercomputer in one location, a visualization unit in another location, and a special high quality printer in still another location. All three resources are essential to the experiment; so, the scientist needs to synchronously lock and use all three distributed resources for the same time period to complete the computing task. The distributed heterogeneous resources together form a networked virtual supercomputer or metacomputer . Furthermore, the scientist wants resources to be scheduled automatically as a service of the appropriate software, whether with or without the inclusion of specific supplemental information such as the times the user is available to perform the experiment.
Traditional metacomputing resource allocation [1, 10] uses a central authority for scheduling, usually for efficiency. As a simple example, the IBM SP2 employs a scheduling algorithm  that reduces the wait time of jobs requiring only a few nodes, if these can be scheduled without delaying more computationally intensive jobs.
By contrast, consider the computation needs of users requiring resources managed by different groups in different geographic regions. Scheduling in this case is more complicated because it is impractical for individual sites to ``know'' global information that would help them to do more efficient scheduling .
The owner of a set of resources may have resource management policies that are different from those of owners of other resource sets. Our challenge is two-fold: (i) to establish methods of cooperation among owners so that the collection of owners offers system-wide resources to users, and (ii) to make the algorithms scalable so that new providers of resources can enter the common resource pool quickly and semi-autonomously.
The infrastructure for reserving resources in a distributed system is required in many applications. Our research deals with designs and implementations of distributed resource management schemes that coordinate different resource management schemes for different sets of resources. Though this paper addresses resources used in metacomputing, our research deals with resources in many distributed applications.
A convenient abstraction for such applications represents each indivisible resource by an indivisible token of some color ; different types of resources have different colors. For instance, a node of an IBM SP2 can be represented as a token of the IBM SP color. Likewise, a room in a hotel can be represented by a token of the hotel color.
Our model deals with time explicitly. So, a reservation can be made for 64 nodes of an IBM SP2 for 10 contiguous hours, or a hotel for seven nights. We also deal with locality in the sense that requests can be made for collections of tokens that are ``near'' each other.
At this point, we do not deal with arbitrary relationships and constraints between token colors. For instance, we do not automatically handle the relationship that a 200 MHz Pentium Pro is 1.8 times as fast as a 100 MHz Pentium Pro. Each kind of machine is represented by a token of a different color, and we do not deal automatically with relationships between tokens of different colors. Methods for registering relationships between colors and for automatically exploiting these relationships is a subject for future research.
Figure 1: Two models are given for resource reservation. On the left, the client simply asks for specific machines. On the right is a more advanced request, in which the client asks the Resource Reservation System (RRS) for two SGIs and two Pentiums. The RRS connects to separate resource managers that schedule time on the two clusters (using, for example, our calendar-based algorithm).
The centralized IBM SP2 scheduling algorithm relies on knowledge of how many nodes each process needs to ``promote'' less computationally-intensive tasks as necessary. On the other hand, as illustrated in figure 1, if each node in a supercomputer was to be scheduled independently in a distributed way, efficient scheduling becomes much more difficult. We believe that as metacomputing applications use distributed heterogenous systems, they will need algorithms for efficient distributed resource scheduling. In addition, in less benign environments, negotiation protocols might need to leverage the notion of resources as economic currency, perhaps employing fine-grained electronic commerce protocols.
This paper presents a general framework for heterogeneous resource reservation. Within this framework, we present a simple object-oriented Java implementation using Infospheres . Specific contributions are:
In section 2, we discuss some simple first attempts at distributed resource allocation algorithms, describing how they fail. As a solution, we employ calendars, which are useful for efficient resource allocation. In section 3, we describe how the calendar metaphor builds on our resources-as-tokens metaphor. A simple application to safe metacomputer scheduling across distributed resource managers is presented in section 4, after which we discuss efficient scheduling when resources are specified by attribute. We conclude with some observations in section 5.
The problem of distributed resource reservation has several simple solutions that are correct but may nonetheless be insufficient. We discuss the inadequacies of two such solutions - central server and local clocks - and demonstrate how calendars provide a more scalable solution.
One approach to resource reservation is to attempt to lock all of the resources the application wants. If an application is unable to lock a resource, it enters a queue waiting for it based on the priority of a logical local clock timestamp . If an application with lower priority has the resource but is not yet using that resource, that application must relinquish the resource (or token), deferring to the higher priority. This method is robust, and fairly scalable, but can be inefficient. For example, as illustrated in figure 2, client application 1 can be using resource 1, while client 2 is waiting to use it. 2 has locked resource 2 and is not using it, but is still preventing client 3 from using 2 (which 3 could use since it requires no other resources to run its task).
Figure 2: This picture illustrates the failure of the local clocks solution to sometimes use available resources. Client 1 holds resource 1, and client 2 is next in line for both resources 1 and 2. Because client 2 is blocking it, client 3 must wait for client 2 to finish (2 has higher priority), and hence client 1 to finish, even though client 3 does not even use resource 1. By using calendars, we ensure that a client is not locked out of an idle resource.
As discussed in section 1, we could improve efficiency by using a central scheduling algorithm allows a resource manager such as the one used by the IBM SP2. In fact, we could envision a global network with a single central server, which because of complete information, makes efficient allocations. However, this is clearly impractical from a scalability standpoint. Our goal is to recover some or all of the efficiency of a worldwide central server while maintaining the scalability features of distributed resource-management servers. In such a system, each resource has a server that controls access to that resource, and each client has the responsibility to communicate with these distributed servers to reserve distributed resources.
Calendars allow a nice tradeoff between scalability of resource managers and efficient utilization of resources. Allowing an application to ``make appointments'' in a calendar for resource reservation, a resource cannot be blocked from use while sitting idle: if a resource is unused, no application has an appointment for it at that time. Thus, efficient resource allocation is possible without global information. This calendar model is easily extensible to general resource reservation.
Individual resources use a calendar metaphor for arranging their schedule; the basic calendar functionality our implementation provides includes the concepts of time slots and access lists.
A time slot consists of a time interval with a particular time unit grain. Every time slot can be in one of three states: locked, held or available. Locked slots exist when a client has committed to using a resource during that slot; as a result, locked slots can be read but not written. Held slots are slots that a particular client is considering locking, but has not yet committed to locking. Only that particular client can write to these slots, thereby locking them; they are read-only for other clients. However, unlike a locked slot, a held slot can be released, reverting to available. Available slots may be read or written and converted to held or locked status. Note that slots correspond to the tokens discussed in section 1; so, resource reservation is tantamount to collecting the proper tokens.
Each slot has an associated access list that keeps track of which processes can obtain a lock on that slot. For instance, a resource manager may provide access to an authorized user from the Center for Research on Parallel Computation, but not grant access to anyone else. Thus, some slots may be available to only one set of users, while others can be available to other sets of users. This approach uses information hiding of slots to individuals not on their respective access lists, making it different from traditional ``whiteboard scheduling'' models.
The reservation of a set of resources is determined when all of the resource managers (or servers) agree to lock the slots that correspond to the same time. We implement reservations atomically using a two-phase commit protocol ; we note that the action starting with resource-request initiation and ending with resource-reservation commitment corresponds to an Infosphere session .
Figure 3: Our model for scheduling a meeting starts when a client sends agents to the various resources it desires. Agents communicate both with the client, and with the resources. For more efficiency, groups of nearby agents can coordinate to avoid excessive message-passing to respective clients, who may be geographically distant. The resource managers or servers can also send agents to clients to request back the slots that the clients hold.
Our general paradigm for resource reservation, using client requests and brokering agents, provides a testbed on which effective algorithms can be developed for specific tasks. This model is illustrated in figure 3.
Resource reservation is initiated by a client application making a request; note that this is the only interaction a user needs to have with the system. A resource can be represented generally by a boolean function over all possible Cartesian products of resources and meeting times, with additional weights given to represent hints. For example, requiring one Pentium and one SGI on Monday at 10am is a possible request that assigns boolean true to all combinations of resources that include the desired Pentium and SGI. In addition, hints can help the system choose more appropriate scheduling agendas. Although the general framework is too complex to implement directly in some applications, for any particular application, a suitable subset can be implemented.
Like ambassadors sent to foreign countries, the client's system can send a small set of instructions in Java to any resource manager to request computing time. We call these programs agents . Several efficiency improvements make agent communication attractive:
Note that not only can clients send agents to servers, but servers can send agents to clients to request back slots that clients had on hold, upon request from a client that has higher priority. Our system requires that the agent recipient must within a certain time period lock the slot, or the slot will be automatically returned to the resource manager.
In endeavoring to schedule resources, agents communicate with resource managers on servers using query, lock, release, and wait messages, as illustrated in figure 4.
Figure 4: Agents are executed atomically and communicate with their server and the outside world through a receptionist class which provides only non-blocking send and receive methods. Agents run in the address space of the server, but privacy is still possible by use of the Java security manager.
To illustrate our framework, we discuss two applications: the scheduling of specific resources controlled by more than one resource manager, and scheduling by attribute. We give examples of efficient algorithms that could be easily implemented within our framework.
Consider the simple case of scheduling two or more specific resources, each controlled by a different resource manager. One solution is to use local clocks (discussed in section 2 to place on hold each individual desired resource's calendar before scheduling computing time by locking the appropriate slots. Note that has the efficiency problems discussed in section 2, but they will be smaller since we are using the algorithm only to schedule calendars, and the time for this scheduling is significantly less than the time for resource use. Thus, just introducing the calendar metaphor provides substantial savings.
We note that in this algorithm, when one user is reserving time on a given resource, all other users are excluded, while in reality we need mutual exclusion only on individual slots for safety. We can therefore improve on this algorithm's coarse-grained parallelism: for greater efficiency of scheduling, a client can use finer-grained adaptive control to place on hold only a small part of the resource manager's calendar at a time.
Figure 5: This graph shows the cost of polling n resources when only k will be used. Usually an optimal number for n ranges from k to twice k.
As indicated in section 1, we would like to offer resource reservation by attribute. For example, a scientist might want 3 SGIs and 2 Pentiums. This is easily integrated into our existing framework. The ``and'' clause signals resources that must be reserved together, and therefore these can be treated as specific resources themselves. We are thus reduced to a scheduling problem such as ``get 3 SGIs out of the 30 known to my system.'' Formally, we want n out of k homogeneous resources where (k > n). We can use the algorithm for multiple resource managers if we send agents to p out of the k resources, but then pick the ``best'' (or earliest) time at which n out of the p polled resources are available. Our problem then reduces to choosing p. Choosing (p = k) may not always be the best solution, due to message passing costs due to network lag. For this reason, we have formulated a simple mathematical model that allows us to choose the optimal p, and are currently studying better models for the Internet.
In our mathematical model, we define the delay as the expected delay in scheduling a job given the probability q that a given slot would be unavailable. We want to minimize the delay. A further term is added to cover message-passing that is proportional to the number of resources p that we poll. An objective function can be constructed by adding the delay to a term proportional to p. With this, graphs of cost versus p can be plotted, allowing us to find the optimal number of nodes to poll, p. A sample graph is shown in figure 5.
We have investigated generalizable metacomputer allocation algorithms for which desired resources can be specified by attribute only, and for which different resource managers can coordinate in a synchronized fashion. Our model builds on the concepts of resources-as-tokens and the metaphor of calendars for scheduling. To improve efficiency under high network latency, our implementation passes small Java programs as agents for coordination. Our design represents the first step toward the development of a robust scheduling infrastructure, layered above conventional schedulers currently available, for the next generation of virtual supercomputers - constructed from heterogeneous resources distributed over the Internet.
This work was supported under the Caltech Infospheres Project. The Infospheres Project is sponsored by the Air Force Office of Scientific Research under grant AFOSR F49620-94-1-0244, by the CISE directorate of the National Science Foundation under Problem Solving Environments grant CCR-9527130, by the NSF Center for Research on Parallel Computation under Cooperative Agreement Number CCR-9120008, and by Parasoft and Novell.
A General Resource Reservation Framework
for Scientific Computing
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 adam.tex.
The translation was initiated by
on Wed Jun 11 18:09:57 PDT 1997
This paper is also available in postscript format.