Assignment 2 (IBM Model 1)

Alex Fraser

Sample data and pseudo-code from Philipp Koehn's book.

Implement the EM algorithm for IBM Model 1 in your favorite programming language or a package like Matlab. You can work in pairs or alone.

Either: 1) start with the small toy set and then work on your choice of German/English or French/English (these are .tar.gz files)

Toy 6 short sentences
French--English: 1000 sentences, 292 short sentences
German--English: 1000 sentences, 292 short sentences

or 2) PREFERRED: use your 10 correct sentences from assignment 1 as the toy set, and then work on data for the same language pair as you used for assignment 1. Download the data from the joshua indian corpora page (google: joshua indian corpora), ideally you should only use a small number of short sentences at first, so that things run quickly.

Your program should output two different things:

A table containing the word translation probabilities that were learned (note: later you should think of an efficient data structure for such a sparse matrix)
The most likely alignment (the Viterbi alignment) for each sentence pair in the training data. If you have java, then you can use the format of the TestAlign8 tool. You may also want to concatenate those sentences to the beginning of the data you are working with; after you align them, you can take the first 20 lines and look at them in the tool (put them in the "hyp" file, they will be shown in red).

Pseudo-code of EM for IBM Model 1:

 initialize t(e|f) uniformly
 do until convergence
   set count(e|f) to 0 for all e,f
   set total(f) to 0 for all f
   for all sentence pairs (e_s,f_s)
     set total_s(e) = 0 for all e
     for all words e in e_s
       for all words f in f_s
         total_s(e) += t(e|f)
     for all words e in e_s
       for all words f in f_s
         count(e|f) += t(e|f) / total_s(e)
         total(f)   += t(e|f) / total_s(e)
   for all f
     for all e
       t(e|f) = count(e|f) / total(f)

Basic Exercise

Before you implement Model 1, start by convincing yourself that the incredibly simple estimation you do by running the main loop of the pseudo-code once gives the same results as explicitly enumerating the alignments in slide 41 (the slide where we calculated counts by working on four alignment functions by explicitly enumerating each one). You have to start with the t values on slide 41 to do this, and you apply them to just the pair of two word sentences on slide 41. Please turn this in.

Basic Questions

Word alignments are usually calculated over lowercased data. Compare your alignments with mixed case versus lowercase. Have you seen an improvement in the first part of the corpus? Give three examples of changes and explain why they occurred.
How are non-compositional phrases aligned, have you seen any problems? Give three examples.
Generate an alignment in the opposite direction (e.g. swap the English and SAR-language/French/German (Other Language) files and generate another alignment). Judging by examining the first part of the corpus, does one direction seem to work well to you?
What are the longest English and Other Language tokens? Look at the alignment of the longest English token, and then the longest Other Language token. Are they aligned well? Why?

Advanced Questions (Optional)

What data structure did you use to implement the sparse matrix, and why?
Are cognates at the beginning of the corpus aligned correctly? How could we force cognates to be aligned correctly? (Warning, this is a trick question)
Implement union and intersection of the two alignments you have generated. What are the differences between them? Consider the longest tokens again, is there an improvement?
The Porter stemmer is a simple widely available tool for reducing English morphology, e.g., mapping the plural of a word and the singular of that same word to one single token (and other transformations). Download a porter stemmer package from the web and apply it to the English. Then compare an alignment with porter stemming versus one without. What are the differences?
Calculate p(e|f) for the first three sentences.