There is also Ubuntu packages of mkcls and GIZA++ at http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/dists/dapper/nlp/. However I want to use giza-pp (including, mkcls) 1.0.1 from http://code.google.com/p/giza-pp/. Thus I pack it. The results are as follow:
gizapp_1.0.1-1ubuntu5.diff.gz
gizapp_1.0.1-1ubuntu5.dsc
gizapp_1.0.1-1ubuntu5_i386.build
gizapp_1.0.1-1ubuntu5_i386.changes
gizapp_1.0.1.orig.tar.gz
giza++-static_1.0.1-1ubuntu5_i386.deb
mkcls_1.0.1-1ubuntu5_i386.deb
Changes note: There are some changes in command line interface as follow:
GIZA++ changed to giza++.
snt2plain.out changed to snt2plain.
plain2snt.out changed to plain2snt.
snt2cooc.out changed to snt2cooc.
trainGIZA++ changed to train-giza++.
Lintian reported many warnings but I still don’t know how to fix them :-P.
Update: To pass Lintian tests, man pages are needed.
Usage example
Given there 2 parallel plain text files in English and Thai.
eng.txt:
a dog eat a chicken
a chichken eat a fish
tha.txt:
หมา กิน ไก่
ไก่ กิน ปลา
In order to align these text, we use this script as follow:
$ plain2snt eng.txt tha.txt
w1:eng w2:tha
eng -> eng
tha -> tha
$ train-giza++ eng.vcb tha.vcb eng_tha.snt
END.
Then the result, will be in GIZA++.A3.final :
$ cat GIZA++.A3.final
# Sentence pair (1) source length 5 target length 3 alignment score : 0.0373314
หมา กิน ไก่
NULL ({ }) a ({ }) dog ({ }) eat ({ }) a ({ 2 }) chicken ({ 1 3 })
# Sentence pair (2) source length 5 target length 3 alignment score : 0.0373315
ไก่ กิน ปลา
NULL ({ }) a ({ }) chichken ({ }) eat ({ }) a ({ 2 }) fish ({ 1 3 })
P.S. I built these packages on Ubuntu 7.10