Competition#

Evaluation#

There are a series of Tests which become progressively more difficult. Within each Test, there are a series of Levels which increase the difficulty within the parameters of the Test. The Tests proceed until one of the following stopping criterion are met:

  • The percentage of missed deliveries, averaged over all Levels within a Test, exceeds a threshold of 30%.

  • An overall evaluation time limit of two hours is reached.

  • All tests are completed.

Scenarios#

Generative Models#

  • Map. We use a randomly generated environment as the basis for each episode. As bodies of water introduce constraints as to where airports may be placed, they can become an important factor in air routing problems. For this reason, we generate maps using Perlin noise, which is commonly used in the creation of procedurally generated games [12].

  • Airports. Airports are placed uniformly at random within the map. A pickup zone is placed on the left side of the map and indicates where cargo are generated, whereas a dropoff zone is defined on the right side of the map indicating where thc cargo must be dropped off.

  • Routes. Routes are generated based on the maximum range of aircraft. All routes are bidirectional. To limit the degree of the route network only a subset of possible routes are placed. Two airplane types are defined, one of which is a slower long-range aircraft with larger cargo carrying capacity, and the other of which is a short-range aircraft with a smaller capacity. The long-range aircraft is limited in its ability to land at airports in the small pickup and dropoff zones. Meanwhile, short-range aircraft are only able to travel within the zone and to airports adjacent to the zones. In general, this means that cargo cannot be delivered by one airplane: a successful delivery can require that the cargo be loaded on up to three airplanes.

  • Cargo. Cargo are generated in the dropoff zone uniformly at random. The delivery deadlines are scaled in relation to the shortest path between its location and destination, as well as average shortest path distances between all airports (to account for pickup time).

  • Dynamic Events.

    • Routes are randomly taken offline according to a Poisson process. The duration that the route is offline is generated uniformly at random within a given range. This range will vary for different routes The agent is able to observe both the expected duration for each route, as well as the end time of the event once it goes offline. See Table 7

    • New cargo is generated during the episode according to a Poisson process.

Scoring#

First, we assign an episode score. We use a large weight \(\missedcoeff\) to give missed deliveries the largest penalty, and then give a smaller penalty to lateness using a smaller weight \(\latenesscoeff\). Flight cost is penalized by \(\totalflightcostcoeff\). Then, the objective is to minimize the episode score

(6)#\[\episodescore = \missedcoeff * \missed + \latenesscoeff * \scaledlateness + \totalflightcostcoeff * \scaledflightcost\]

To determine an overall score for the submission, we find it useful to normalized against baseline scores and produce a score that increases with improved performance. For each scenario, we calculate episode scores \(\randomscore\) and \(\baselinescore\) for the Random Agent and “shortest path” baseline algorithm, respectively. Then given a episode score for a solution, we calculate a normalized score as

\[\normalizedescore = \frac{\randomscore - \episodescore} {\randomscore - BaselineScore}\]

A solution will receive a normalized score of 0 if it only performs as well as a random agent, and will receive a score of 1 if it performs as well as a simple “shortest path” baseline algorithm. Scores greater than 1 indicate that the algorithm is exceeding the performance of the baselines. Note that it is possible to obtain negative normalized scores if the agent performs worse than the random agent.

Finally, An overall score is calculated by summing the normalized score over all tests/levels.

\[\overallscore = \sum_{i = 1}^\textrm{# of completed tests} \sum_{\textrm{levels in test } i} \normalizedescore(i)\]

Time Limits#

We impose a time limit on the algorithm for each time step (10 seconds), allowing extra time during the first step for pre-planning (10 minutes). We will ignore any action for a step where a timeout occurs (proceeding to operate based on the previous action). Subsequent steps will proceed normally.

An overall time limit of 2 hours will be placed on each submission run. If the evaluation on the entire scenario set exceeds this time, the evaluation will be halted. The score will be calculated based on the tests run up to that point.