Recent algorithmic advances and hardware innovations have made it possible to train deep neural networks with billions of parameters. The networks’ performance, however, depends in part on hyperparameters such as the learning rate and the number and width of network layers.
Tuning hyperparameters is difficult and time-consuming, even for experts, and criteria like latency or cost often play a role in deciding the winning hyperparameter configuration. To make latest deep-learning technology practical for nonexperts, it is essential to automate hyperparameter tuning.
At the first International Conference on Automated Machine Learning (AutoML), we presented Syne Tune, an open-source library for large-scale hyperparameter optimization (HPO) with an emphasis on enabling reproducible machine learning research. It simplifies, standardizes, and accelerates the evaluation of a wide variety of HPO algorithms.
These algorithms are implemented on top of common modules and aim to remove implementation bias to enable fair comparisons. By supporting different execution backends, the library also enables researchers and engineers to effortlessly move from simulation and small-scale experimentation to large-scale distributed tuning on the cloud.
In this post, we will give an overview of the execution backends supported in Syne Tune and benchmark state-of-the-art asynchronous HPO algorithms, including transfer learning baselines.
Supported execution backends
Syne Tune provides a general interface for backends and three implementations: one to evaluate trials on a local machine, one to evaluate trials in the cloud, and one to simulate tuning with tabulated benchmarks to reduce run time. Switching between different backends is a matter of simply passing a different trial_backend parameter to the tuner, as shown in the code examples below. The backend API has been kept lean on purpose, and adding new backends requires little effort.
This backend evaluates trials concurrently on a single machine by using subprocesses. We support rotating multiple GPUs on the machine, assigning the next trial to the least busy GPU (e.g., the GPU with the fewest number of trials currently running). Trial checkpoints and logs are stored to local files.
Running on a single machine limits the number of trials that can run concurrently. Moreover, neural-network training may require many GPUs, even distributed across several nodes, or multi-GPU devices. For those use cases, we provide an Amazon SageMaker backend that can run multiple trials in parallel.
A growing number of tabulated benchmarks are available for HPO and neural-architecture-search (NAS) research. The simulation backend allows the execution of realistic experiments with such benchmarks on a single CPU instance, paying real time for the decision-making only.
To this end, we use a timekeeper to manage simulated time and a priority queue of time-stamped events (e.g., reporting-metric values for running trials), which work together to ensure that interactions between trials and the scheduler happen in the right ordering, whatever the experimental setup may be. The simulator correctly handles any number of workers, and delay due to model-based decision-making is taken into account.
Comparing asynchronous tuning algorithms
Syne Tune provides implementations of a broad range of synchronous and asynchronous HPO algorithms. In our experiments, we consider single-fidelity HPO algorithms, which require entire training runs to evaluate a candidate hyperparameter configuration. Random search (RS), regularized evolution for architecture search (REA), and Bayesian-optimization variants (e.g., Gaussian-process-based (GP) and density-ratio-based (BORE), of which TPE is a special case) fall in this category.
We also consider multi-fidelity HPO algorithms, which stop unpromising training runs early. The median stopping rule (MSR), asynchronous successive halving (ASHA), and asynchronous Bayesian-optimization variants (e.g., BOHB and MOB) are prominent examples.
The table below shows the normalized rank, averaged over wall-clock time, of these single- and multi-fidelity optimizers on three publicly available neural-architecture-search benchmarks: FCNet, from Klein and Hutter (2019); NAS201, from Dong and Yang (2020); and LCBench, from Zimmer et al. (2021).
Multi-fidelity algorithms are in general superior to single-fidelity algorithms, which is expected, as they make more efficient use of the computational resources available to them. These results are also consistent with previous results reported in the literature. It should be noted that among the multi-fidelity algorithms, MSR is the only one not using successive halving, and it performs worst.
The table also shows the average normalized rank of transfer learning approaches. Hyperparameter transfer learning uses evaluation data from past HPO tasks in order to warmstart the current HPO task, which can result in significant speed-ups in practice.
Syne Tune supports transfer-learning-based HPO via an abstraction that maps a scheduler and transfer learning data to a warmstarted instance of the former. We consider the bounding-box and quantile-based ASHA, respectively referred to as ASHA-BB and ASHA-CTS. We also consider a zero-shot approach (ZS), which greedily selects hyperparameter configurations that complement previously considered ones, based on historical performances; and RUSH, which warmstarts ASHA with the best configurations found for previous tasks. As expected, we find that transfer learning approaches accelerate HPO.
Our experiments show that Syne Tune makes research on automated machine learning more efficient, reliable, and trustworthy. By making simulation on tabulated benchmarks a first-class citizen, it makes hyperparameter optimization accessible to researchers without massive computation budgets. By supporting advanced use cases, such as hyperparameter transfer learning, it allows better problem solving in practice.
To learn more about the library and contribute to it, please check out the paper and our GitHub repo for documentation. We just released the 0.3 version, with new HPO algorithms, new benchmarks, tensorboard visualization, and more.