Question-answering (QA) models sometimes need to retrieve information from tables, which use an entirely different set of semantic cues than free-form text. Historically, most work on table-based QA has concentrated on extracting the contents of a single table cell as the answer to a question.
But sometimes, the questioner needs more context to make sense of an answer, so recent work on table-based QA has investigated the possibility of embedding tabular data in sentences or sequences of sentences. So far, the most successful models have been end-to-end neural models, which take a question and a table as input and output a free-form answer to the question.
At this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI), my colleagues and I are presenting a new approach to training table-based, free-form QA models in which the model is pretrained on synthetic data before being fine-tuned on a real QA dataset. We call our model GenTaP, for generation-focused table-based intermediate pretraining.
We pretrain the model on two objectives simultaneously: one is a sentence-style answer to the question, and the other is an answer extracted from a single table cell, often a name or number. In experiments, we compared our model to four previous end-to-end models, on five different metrics, and ours was the top performer across the board, improving on the best prior model by 14%, according to the BLEU metric.
The key to our approach is generating synthetic training data with no human involvement, to make the pretraining pipeline efficient.
To produce the long-form training examples, we identify online documents that include tables. From those documents, we extract sentences that contain at least two cell values that share a row in the table. Then, using a separate machine learning model, we convert the sentences into questions.
As input, the question generation model takes a sentence and the corresponding entries from the table. To train the model, we used an existing dataset for training reading comprehension models, which consists of questions and document excerpts that provide the information needed to answer them. Except that we invert the relationships between inputs and outputs.
The question generator’s outputs give us sets of data triplets — table, question, and answer — that we can use to pretrain the QA system. The tables are converted into strings with special characters separating rows and appended to the questions as inputs. The QA model then learns to predict the answers.
In addition to long-form answers, we also train the model on automatically generated question-answer pairs in which each answer consists of a single cell value from the table. We generate these pairs using a simple grammar — a set of phrase and sentence templates that randomly sample data from the tables to produce new sentences.
During pretraining, we use equal numbers of long-form and short-form examples. The idea is that the long-form targets improve the coherence of the QA model’s outputs, while the short-form targets improve its factual accuracy. Our experiments showed that omitting the short-form targets during pretraining does slightly diminish the model’s performance on the test set.
The model itself is an encoder-decoder model, with two decoders, one for each of the two output objectives.
After pretraining our model on the synthetic data, we ran two experiments on it using a hand-annotated QA dataset. In the first, we simply tested the pretrained model on the dataset’s test examples, without further fine-tuning — a zero-shot experiment. In the second experiment, we first fine-tuned our model on the dataset’s training set and then retested it.
As benchmarks, we used four models based on the T5 language model and a fifth model based on the BART language model. We used five different evaluation metrics: the BLEU metric, which measures the overlap between the model’s output and the target output in the hand-annotated dataset; three ROUGE metrics (ROUGE 1, ROUGE 2, and ROUGE L), all of which measure phrasal overlap between output and target; and METEOR, which factors in synonyms and shared roots when assessing sentence matches.
Our model was the best-performing across the board, with a BLEU score improvement of 14% over the second-best model (the one based on BART) and improvements of 5% to 10% on the other four metrics.
Our zero-shot model outperformed the benchmark built atop the small version of the T5 language model — even though the T5 benchmark was trained on the dataset’s full training set. And the zero-shot model fell just a little short of the benchmark built atop the base T5 model (also trained on the full training set).
We also tested our pretrained model on a different task: generating domain-specific sentences (not question answers) based on tabular data, with limited numbers of training examples (50 to 500). On that task, our model outperformed two benchmarks based on the GPT language model, indicating that our approach may adapt well to other applications.