Simulator Interface#
The simulator takes the form of a cooperative multi-agent system [11] implemented using the OpenAI Gym and Petting Zoo interfaces. The environment is advanced by calling the environment step method:
obs, rewards, dones, infos = env.step(actions)
This method receives actions for all agents from the algorithm, and updates the internal state of the environment.
It then returns an observation of the environment (which is customizable by the participants) and a reward.
All variables are obs
, rewards
, dones
, infos
, and actions
are dictionaries which are indexed by agent ID.
A list of agent IDs can be obtained from env.agents
.
Action Space#
The simulator implements incorporates aspects of the OpenAI Gym and Petting Zoo “Parallel Environment” reinforcement learning interfaces.
The action space is defined in Table 4. This space adheres to the OpenAI Gym space standard, and incorporates a custom-defined “List” space.
The actions are designed such that the agent does not need to provide fine-grained control of the airplane, but rather can provide a list of “orders” for the airplane to follow at each airport.
For example, by issuing a single action, the agent can order an airplane to (1) load cargo, (2) unload cargo, (3) set priority and (4) take off for a destination when done processing.
At subsequent steps, the airplane state machine will execute the action automatically, and the agent need not issue a new action until the airplane reaches its next destination (or a change in the environment requires an updated action plan).
As the state machine proceeds, it will automatically update the current action (e.g. emptying the cargo_to_load
list once the cargo has started loading).
Priority of the airplane effects the processing queue. This priority can be updated at every time-step, but cannot be updated after an agent enters the queue for processing. Once processing is complete, the priority can then be updated again. The number of priorities allowed in the action space is equal to the total number of agents within the scenario.
Key |
Model Definition |
Space Type |
Description |
Valid values from observation |
---|---|---|---|---|
|
\(\priority{p}\) |
|
Integer indicating the priority of the airplane. Effects the ordering of the queue for processing. |
0 = Do not process 1 = Highest priority. num_agents in scenario = Lowest priority. |
|
\(\cargotoload{p}\) |
|
Cargo ID’s to load onto plane when processing |
Choose any subset of cargo from |
|
\(\cargotounload{p}\) |
|
Cargo ID’s to unload from plane when processing |
Choose any subset of cargo from |
|
\(destination\) |
|
Contains ID of an airport where the Airplane will travel to next. |
Choose a single airport ID from |
Observation Space#
Table 5 shows the observation space itself.
The observation contains information specific to an agent, whereas the “globalstate” entry contains information global to all agents (including the observations of all other agents).
Note that the “globalstate” value is the same as that returned by the environment’s state()
method.
Table 6 and Table 7 describe subspaces that are used in the observation, and
we use a custom space named DiGraph which represents a NetworkX directed graph.
Key |
Model Definition |
Space Type |
Description |
---|---|---|---|
|
\(\currentairport{p}{t}\) |
|
ID of airport where airplane currently is located, or |
|
\(\CargoOnPlane{p}{t}\) |
|
ID’s of cargo that is onboard the airplane |
|
\(\weightcapacity{p}\) |
|
Maximum cargo weight that can be carried by airplane |
|
\(\sum_{c \in \CargoOnPlane{p}{t}} \cargoweight{c}\) |
|
Current cargo weight carried by airplane |
|
\(\CargoAtAirport{\currentairport{p}{t}}{t}\) |
|
ID’s of cargo that is stored at the current airport. |
|
N/A |
|
Contains the agent state. See: list of airplane states and state machine definition. |
|
\(\{ a_2 \in \Airports | (\currentairport{p}{t}, a_2) \in \AvailableRoutes{t} \}\) |
|
ID’s of airports that can be reached from the current airport. Routes which are disabled will not be included in this list. |
|
N/A |
State Space |
The global state of the environment. |
|
N/A |
Action Space (see Table 4) |
Contains the last action issued by the agent. Entries will be modified by the state machine as transitions occur. |
|
N/A |
|
Contains information about the scenario. At the moment this only includes processing time. |
The state space uses the following named tuple to describe cargo:
Key |
Model Definition |
Space Type |
Description |
---|---|---|---|
|
N/A |
|
ID of the cargo |
|
\(\cargoloc{c}\) |
|
ID of airport where cargo is located, or |
|
\(\cargodest{c}\) |
|
ID of destination airport where cargo is to be delivered. |
|
\(\cargoweight{c}\) |
|
Weight of the cargo |
|
\(\softdeadline{c}\) |
|
Soft deadline for delivery. This is the target delivery time. |
|
\(\harddeadline{c}\) |
|
Hard deadline to deliver cargo by. Considered missed if not met. |
Key |
Space Type |
Description |
---|---|---|
|
|
List of active cargo (cargo not delivered or missed) |
|
|
List of new cargo which was generated and added to the active cargo list in the last step. |
|
Dictionary of agent observations |
Contains the observation for each agent |
|
|
Route map which contains information about airports and routes. This is a NetworkX DiGraph class. |
Key |
Space Type |
Description |
---|---|---|
|
|
Processing time for all airports. |
Reward#
The reward signal is shown in Table 9, and follows directly from the objective function defined in (4).
Condition |
Reward Value |
---|---|
When cargo item \(c \in \Cargos\) becomes missed |
\(-\alpha\) |
For each time step where cargo item \(c \in \Cargos\) is late |
\(-\beta\) |
Each time step during which an airplane is in flight |
\(-\gamma\) |
Info Dictionary#
The step method returns an info
dictionary with an entry for each agent.
The value is a dictionary with the following entries:
Key |
Space Type |
Description |
---|---|---|
|
|
List of warnings resulting from the action (for example trying to load cargo that is not at the current airport). |
|
|
(Only returned by the local evaluator env_step method) Indicates whether there was a timeout during this step. This will occur when the solution policy takes too long to issue the next action. |
Stopping criteria#
The episode ends when all cargo is either (1) delivered or (2) missed, and when there is no more cargo to be generated by the dynamic cargo generator.