Distributionally Robust Optimization For Language Modeling
Backgrounds Datasets for training language models (LMs) are typically sampled from a mixture of many domains. For example, the Pile, a large publicly available dataset, is composed of 24% web data...