Machine translation quality estimation (MTQE) is the task of predicting the quality of a machine translation output without access to a human reference. MTQE is useful for various applications, such as selecting the best translation among different systems, filtering out low-quality translations, or deciding whether a human post-editing is needed.
In this blog post, we will explain the main challenges and methods of MTQE, as well as some of the evaluation metrics and datasets that are commonly used in this field. We will also discuss some of the recent advances and open problems in MTQE research.
MTQE Challenges and Methods
One of the main challenges of MTQE is that quality is a subjective and multidimensional concept, which depends on various factors such as fluency, adequacy, style, terminology, etc. Moreover, quality may vary depending on the purpose and the target audience of the translation. Therefore, defining and measuring quality is not a trivial task, and different MTQE methods may adopt different definitions and assumptions.
Another challenge of MTQE is that it requires a large amount of annotated data, i.e., translations with quality labels or scores. However, obtaining such data is costly and time-consuming, as it involves human experts or crowd workers. Moreover, the annotations may be noisy or inconsistent, as different annotators may have different criteria or preferences. Therefore, MTQE methods should be able to deal with data scarcity and noise.
A third challenge of MTQE is that it should be able to handle different types and levels of granularity of the quality estimation. For example, MTQE can be performed at the word level, the sentence level, or the document level. Moreover, MTQE can provide different types of outputs, such as binary labels (good/bad), numerical scores (e.g., 0-100), or error types (e.g., word order, spelling, grammar, etc.). Therefore, MTQE methods should be flexible and adaptable to different scenarios and user needs.
There are two main types of MTQE methods: model-based and feature-based. Model-based methods use machine learning models, such as neural networks or regression models, to learn the relationship between the translation input and output and the quality score or label. Feature-based methods use hand-crafted or automatically extracted features, such as lexical, syntactic, semantic, or cross-lingual features, to represent the translation input and output and then apply a classifier or a regressor to predict the quality score or label.
Both types of methods have advantages and disadvantages. Model-based methods can capture complex and non-linear patterns in the data, but they require a large amount of annotated data and computational resources. Feature-based methods can leverage linguistic knowledge and external resources, but they require domain expertise and feature engineering. Moreover, both types of methods may suffer from data sparsity or domain mismatch issues when applied to new languages or domains.
MTQE Evaluation Metrics and Datasets
To evaluate the performance of MTQE methods, several metrics and datasets have been proposed. The most common metrics are:
– Pearson correlation coefficient (PCC): measures the linear relationship between the predicted and the reference quality scores.
– Mean absolute error (MAE): measures the average absolute difference between the predicted and the reference quality scores.
– F1-score: measures the harmonic mean of precision and recall for binary classification tasks.
– Matthews correlation coefficient (MCC): measures the correlation between the predicted and the reference binary labels.
The most common datasets are:
– WMT Quality Estimation shared tasks: a series of annual tasks organized by the Workshop on Machine Translation (WMT) since 2012. The tasks provide data for different languages, domains, granularities, and outputs of MTQE. The data are obtained from human translations or machine translations with human annotations.
– QT21 Quality Estimation dataset: a dataset created by the QT21 project, which focuses on improving machine translation for challenging languages and domains. The dataset contains English-German translations from the IT domain with human annotations at the word and sentence levels.
– APE-QUEST dataset: a dataset created by the APE-QUEST project, which aims to combine automatic post-editing (APE) and quality estimation (QE) for machine translation. The dataset contains English-German translations from the IT domain with human post-edits and quality scores at the word level.
Recent Advances and Open Problems
In recent years, MTQE research has made significant progress thanks to the development of new methods, models, features, resources, and applications. Some of the recent advances are:
– Neural network models: deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or transformers, have been widely used for MTQE tasks, as they can learn powerful representations from raw data without feature engineering. Moreover, neural network models can be combined with other techniques, such as attention mechanisms, multi-task learning, or transfer learning, to enhance their performance.
– Cross-lingual embeddings: word embeddings that map words from different languages into a common vector space have been used as features for MTQE tasks, as they can capture semantic and syntactic similarities and differences across languages. Moreover, cross-lingual embeddings can be used to perform zero-shot or few-shot MTQE, i.e., to estimate the quality of translations for languages or domains that have no or little annotated data.
– Quality estimation as a reinforcement learning problem: quality estimation can be formulated as a reinforcement learning problem, where the agent is the MTQE model, the state is the translation input and output, the action is the quality score or label, and the reward is the agreement with the human annotation. This formulation can enable the MTQE model to learn from its own feedback and to adapt to dynamic environments.
However, there are still many open problems and challenges in MTQE research, such as:
– How to define and measure quality in a more comprehensive and consistent way, taking into account different dimensions, perspectives, and contexts of quality?
– How to obtain more and better annotated data for MTQE tasks, using methods such as active learning, crowdsourcing, or gamification?
– How to improve the generalization and robustness of MTQE models, using methods such as domain adaptation, data augmentation, or adversarial training?
– How to integrate MTQE with other machine translation tasks, such as automatic post-editing, machine translation evaluation, or machine translation ranking?
– How to make MTQE more explainable and interpretable, providing feedback and suggestions to the users or the translators on how to improve the quality of the translations?
Conclusion
Machine translation quality estimation is an important and challenging task that aims to predict the quality of a machine translation output without access to a human reference. MTQE has various applications and benefits for machine translation users and developers. However, MTQE also faces many difficulties and limitations, such as data scarcity, noise, subjectivity, multidimensionality, and granularity of quality. Therefore, MTQE research requires more efforts and collaborations from different fields and disciplines to overcome these challenges and to advance the state of the art.