Differential privacy for deep learning at GPT scale

Date:


Differential privacy (DP) is a formal guarantee that attackers can’t learn whether any given data sample was or was not used to train a machine learning model. Employing DP in deep learning typically means capping the contribution that each training sample makes to the model’s parameter adjustments, an approach known as per-sample gradient clipping.

Per-sample gradient clipping, however, makes deep learning much more time consuming than it would be otherwise, impeding the development of large DP models — for instance, at the level of the GPT language models, with billions of parameters.

In 2022, in workshops at the International Conference on Machine Learning (ICML) and the Conference on Neural Information Processing Systems(NeurIPS), we presented two papers that advance DP deep learning. In the first paper, “Automatic clipping: Differentially private deep learning made easier and stronger“, we propose an automatic method that improves the efficiency of the gradient-clipping process by an order of magnitude (say, 5-10 times).

Typically, gradient clipping involves an expensive ablation study to select a threshold above which a data sample’s contribution to the model’s parameter adjustments is cut off, or clipped. Our approach instead uses normalization, completely eliminating the tuning of the clipping threshold.

Related content

Technique that mixes public and private training data can meet differential-privacy criteria while cutting error increase by 60%-70%.

In the second paper, “Differentially private bias-term only fine-tuning of foundation models” (DP-BiTFiT), which won the Best Paper Award at the NeurIPS Workshop on Trustworthy and Socially Responsible Machine Learning (TSRML), we study a novel method of parameter-efficient fine-tuning that can make DP learning 2-30 times faster, while reducing memory use by 50%-88% and incurring only 1/1000 the communication cost in the distributed environment.

Generally speaking, a neural network has two components: the weights, which constitute more than 99% of the parameters and most of the information from training data (and which, consequently, consume most of the memory and computation time), and the biases, which are used to shift (offset) the model output. We show that privately fine-tuning the bias alone is enough to achieve high accuracy under DP constraints.

Together, these two techniques have made fine-tuning a DP-GPT2 even cheaper than fully fine-tuning a standard GPT2. We have made both methods publicly available, to encourage researchers to experiment with and benefit from faster DP deep learning.

Automatic clipping

The deep-learning process is always governed by a tunable hyperparameter called the learning rate, which controls the training samples’ aggregate contribution to the model’s parameter adjustments. The per-sample gradient clipping threshold, by contrast, is, as its name implies, a per-sample cap. The existing approach to DP training requires an ablation study to simultaneously tune the clipping threshold and the learning rate. If five different clipping thresholds are evaluated, this makes the model’s hyperparameter tuning five times as time consuming.

Two sample ablation studies, considering five different learning rates and five different per-gradient clipping thresholds. The different patterns of results illustrate the need to tune both hyperparameters simultaneously.

We propose automatic clipping, using gradient normalization instead of per-sample gradient clipping, so as to (1) eliminate the clipping threshold, (2) enlarge the small gradients that were not clipped, and (3) do both in a way that provably optimizes performance. Specifically, our work is the first to introduce a nontrivial smoothing constant to the clipping, a requirement for proving the convergence of DP optimizers. Moreover, we show that our new DP-SGD has the same asymptotic convergence rate as the standard SGD, even in the nonconvex-optimization setting, which is where the deep-learning optimization lies.

Related content

Calibrating noise addition to word density in the embedding space improves utility of privacy-protected text.

We further evaluate our method from multiple perspectives. We show that automatic clipping achieves state-of-the-art DP accuracy across 11 datasets, covering multiple computer vision and language tasks. We also find that using our method to train a DP-GPT2-large model requires changing only one line of code in the existing library. Finally, we show that our approach matches prior approaches in terms of both the strength of its privacy guarantees and its algorithmic efficiency. 

Test performance on E2E dataset with GPT2.

DP-BiTFiT

Our differentially private bias-term fine tuning (DP-BiTFiT) is a unique approach that bridges the efficiency gap between DP and standard deep learning. The first advantage of DP-BiTFiT is that it’s model-agnostic; we can apply it to any model by simply freezing all weights during fine-tuning, varying only the bias terms. In sharp contrast, existing alternatives such as low-rank adaption (LoRA) and adapters are applicable exclusively to transformers and involve extra tuning of the adaption layers’ weights.

Parameter efficiency of DP-BiTFiT.

The second advantage of DP-BiTFiT is its amazing parameter efficiency: we show that, across a range of foundation models, the bias terms constitute only around 0.1% of model parameters. This means that DP-BiTFiT provides large efficiency improvements in terms of training time, memory footprint, and, in the distributed-learning setting, communication cost.

The third advantage of DP-BiTFiT is its computational advantage over other parameter-efficient approaches, such as DP-LoRA. Even if both approaches fine-tune roughly the same number of parameters, DP-BiTFiT still enjoys a great advantage in memory saving, because — unlike approaches that compute weight gradients — it does not need to store and access activation tensors, a resource-intensive operation. In our paper, we verify this rigorously through the chain rules of the back-propagation.

The same computation graph of back-propagation (black) with modifications by three different DP procedures (red). Because DP-BiTFiT (lower right) modifies only the model biases, it requires far less computational overhead than prior approaches (GhostClip, left, and Opacus, top right).

Empirically speaking, we have observed a substantial boost in efficiency when switching from DP full fine-tuning to DP-BiTFiT, while still maintaining state-of-the-art accuracy on large foundation models such as GPT2-large, ResNet 152, RoBERTa-large, Vision Transformers, etc. 

Maximum throughput and batch size by different fine-tuning methods. The left two plots are for the E2E dataset and three different sizes of GPT2 model; the right two are for a dataset of 50,000 images and three different sizes of ResNet model.

Acknowledgements: Sheng Zha, George Karypis





Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Popular

More like this
Related