Expiration Timers¶
Expiration timers are set up on the same process as the agent and used from your script to deal with stuck workers. When you go into a code-block that has the potential to get stuck you can acquire an expiration timer, which instructs the timer server to kill the process if it does not release the timer by the self-imposed expiration deadline.
Usage:
import torchelastic.timer as timer
import torchelastic.agent.server as agent
def main():
start_method = "spawn"
message_queue = mp.get_context(start_method).Queue()
server = timer.LocalTimerServer(message, max_interval=0.01)
server.start() # non-blocking
spec = WorkerSpec(
fn=trainer_func,
args=(message_queue,),
...<OTHER_PARAMS...>)
agent = agent.LocalElasticAgent(spec, start_method)
agent.run()
def trainer_func(message_queue):
timer.configure(timer.LocalTimerClient(message_queue))
with timer.expires(after=60): # 60 second expiry
# do some work
In the example above if trainer_func
takes more than 60 seconds to
complete, then the worker process is killed and the agent retries the worker group.
Client Methods¶
-
torch.distributed.elastic.timer.
configure
(timer_client)[source]¶ Configures a timer client. Must be called before using
expires
.
-
torch.distributed.elastic.timer.
expires
(after, scope=None, client=None)[source]¶ Acquires a countdown timer that expires in
after
seconds from now, unless the code-block that it wraps is finished within the timeframe. When the timer expires, this worker is eligible to be reaped. The exact meaning of “reaped” depends on the client implementation. In most cases, reaping means to terminate the worker process. Note that the worker is NOT guaranteed to be reaped at exactlytime.now() + after
, but rather the worker is “eligible” for being reaped and theTimerServer
that the client talks to will ultimately make the decision when and how to reap the workers with expired timers.Usage:
torch.distributed.elastic.timer.configure(LocalTimerClient()) with expires(after=10): torch.distributed.all_reduce(...)
Server/Client Implementations¶
Below are the timer server and client pairs that are provided by torchelastic.
Note
Timer server and clients always have to be implemented and used in pairs since there is a messaging protocol between the server and client.
-
class
torch.distributed.elastic.timer.
LocalTimerServer
(mp_queue, max_interval=60, daemon=True)[source]¶ Server that works with
LocalTimerClient
. Clients are expected to be subprocesses to the parent process that is running this server. Each host in the job is expected to start its own timer server locally and each server instance manages timers for local workers (running on processes on the same host).
-
class
torch.distributed.elastic.timer.
LocalTimerClient
(mp_queue)[source]¶ Client side of
LocalTimerServer
. This client is meant to be used on the same host that theLocalTimerServer
is running on and uses pid to uniquely identify a worker. This is particularly useful in situations where one spawns a subprocess (trainer) per GPU on a host with multiple GPU devices.
Writing a custom timer server/client¶
To write your own timer server and client extend the
torch.distributed.elastic.timer.TimerServer
for the server and
torch.distributed.elastic.timer.TimerClient
for the client. The
TimerRequest
object is used to pass messages between
the server and client.
-
class
torch.distributed.elastic.timer.
TimerRequest
(worker_id, scope_id, expiration_time)[source]¶ Data object representing a countdown timer acquisition and release that is used between the
TimerClient
andTimerServer
. A negativeexpiration_time
should be interpreted as a “release” request.Note
the type of
worker_id
is implementation specific. It is whatever the TimerServer and TimerClient implementations have on to uniquely identify a worker.
-
class
torch.distributed.elastic.timer.
TimerServer
(request_queue, max_interval, daemon=True)[source]¶ Entity that monitors active timers and expires them in a timely fashion. This server is responsible for reaping workers that have expired timers.
-
class
torch.distributed.elastic.timer.
TimerClient
[source]¶ Client library to acquire and release countdown timers by communicating with the TimerServer.