Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer(2020)
August 5, 2023The authors of Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer performed experiments with Text-to-Text Transfer Transformer(T5), a unified framework for NLP.
The basic idea underlying T5 is to treat various NLP problems as taking text as input and producing new text as output.
Their goal is to explore general language learning abilities instead of providing new methods.
They are interested in exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
To perform experiments at scale, they created Colossal Clean Crawled Corpus (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web.
They leveraged Common Crawl to create C4.
They used heuristic for cleaning up Common Crawl web-extracted text, and used langdetect
to filter out any pages that are not classified as English documents.
To train a single model on the diverse set of downstream tasks, including machine translation, question answering, abstractive summarization, and text classification, they cast the tasks into a “text-to-text” format where they add a task-specific (text) prefix to the original input sequence before feeding it to the model. For example, to ask the model to translate a sentence from English to German, the input sequence becomes “translate English to German: That is good.” and the model would be trained to output “Das ist guit.”
They compared three pre-training objectives. The first technique is prefix language modeling, which splits text into two components, one to use as inputs to the model and the other use as a target sequence to be predicted. The second technique is “masked language modeling”(MLM) used in BERT. MLM takes a span of text and corrupts 15% of the tokens. In the encoder-decoder case, they use the entire uncorrupted sequence as the target, while their baseline objective uses only the corrupted tokens as targets. The third technique is the deshuffling objective, which takes a sequence of tokens, shuffles it, and reconstruct the original sequence. MLM performed best.
The models that they study through the experiments are roughly equivalent to the original Transformer except for removing the Layer Norm bias, placing the layer normalization outside the residual path, using the position embedding scheme. While The original Transformer used fixed embeddings for each position, they used relative position embeddings for the experiments.
Training a larger model and using an ensemble of models yielded better results than a single model and a small model. A larger model often outperformed a smaller model on more data for less steps. Ensembling models that were fine-tuned from the same base pre-trained model performed worse than separately pre-training and fine-tuning all models.