Jia Xu, IIIS, Tsinghua University
Abstract: Human language is composed of sequences of meaningful units. These sequences can be words, phrases, sentences or even articles serving as basic elements in communication and components for computational modeling. Automatically finding those sequence boundaries and alignments in a bilingual text is an elementary problem for many natural language processing tasks. Here we evaluate in the area of machine translation, and propose a joint word segmentation and alignment model for phrase based translation. We aim to provide a unifed model where prior knowledge is considered following the Chinese restaurant process. A dynamic word lexicon is learned from the training data in a fully automatic and consistent fashion. Further more, this model can be extended to the phrasal level based on the hierarchical Pitman-Yor Process to capture the context information. We developed an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling and achieve improvements in the translation quality.