本次CS代写的主要涉及如下领域: Python代写,Algorithm代写,Natural Language Processing代写,北美程序代写,加拿大程序代写,CS341代写,University of Waterloo代写
CS 341 Natural Language Processing
Viterbi Decoding for POS Tagging
Professor Heather Pon-Barry
May be submitted up until April 24th
General Instructions:
- This assignment is to implement the Viterbi algorithm and apply it to the problem of part-of-speech tagging, where the hidden states are the POS tags and the observations are the words. In this problem, the Viterbi lattice should be fully connected.
- The data for this assignment is available on Piazza, in the filepos-viterbi-data.zip.
- Starter code in python is provided. It makes use of NLTK, the toolkit used in lab last week. If you want to install nltk on your own computer, instructions are at: http://www.nltk.org/install.html. You’ll need to use the nltk downloader to download thetreebankcorpus.
- The assignment provides you with python code for learning transition and emission probabilities for an HMM. By default, it learns these parameters using tagged sen- tences from NLTK’streebankcoprus. You don’t need to modify this part of the code.
Getting Started
The starter code includes two python files:hmm.pyandhmmtrain.py. Inhmm.pywe define a class calledhmm. As soon as you instantiate this class, the various parameters of the HMM (transition probabilities, emission probabilities etc.) are automatically computed using the firstN(defaultN= 5000) sentences of thenltk.corpus.treebankcorpus as the training data (using functions defined inhmmtrain.py). The following five parameters are available to each instance of thehmmclass:
- transitions:The probability of transitioning from one state to another. To get the probability of going to states2from states1, useself.transitions[s1].prob(s2). Unlike the HMM for ASR, every state-to-state transition has a different probability.
- emissions: The probability of emitting/observing a particular output symbol from a particular state. To get the probability of emitting a word w in state s, use self.emissions[s].prob(w).
1
- priors: The probability of starting in a particular state. To get the probability that the HMM starts in states, useself.priors.prob(s).^1
- states:The states (tags) in the hidden state layer of the trained HMM.^2
- symbols:The output symbols (words) in the trained HMM.
The probability values that are calculated byhmmtrain are going to be extremely
small in scale. Multiplying two very small numbers can lead to loss of precision. Use
the log of the probabilities (logprobs) instead of raw probabilities. For example, use
self.transitions[s1].logprob(s2). Seehttp://nltk.org/_modules/nltk/probability.
htmlfor details of the nltk probability distribution interfaces.
This assignment requires a two-dimensional chart. Rather than using Python lists to
implement such a chart, I recommend using thenumpypackage’sarrayobject and the
float32datatype.
Problem 1: Viterbi Decoding for POS Tagging
(a) Add adecodemethod to thehmmclass that performs Viterbi decoding to find the most
likely tag sequence for a given word sequence.
(b) Add a tagViterbimethod to thehmm class that takes a file with one (tokenized) sentence per line as input and tags the words in the sentence using Viterbi decoding. It should output tagged sentences of the form: This/DT is/VBZ a/DT sentence/NN.
(c) According to yourtagViterbimethod, what is the most likely tag sequence for each
of the five sentences in the filesentences.txt? Include your tagged sentences in your
write-up.
(d) Do you observe any errors in the tags assigned? If so, note them.
(e) Is every word in every sentence assigned some tag? If not, speculate about why this is
the case.
Deliverables
- Save your write-up as a PDF.
- Put your python code in a directory calledcode. If you have any additional scripts, include aREADMEexplaining the contents of each file. I may refer to your code, so be sure to comment your code so I can read it.
- Zip these files and upload to moodle.
(^1) This initial probability distribution is equivalent to using a special stateq 0 , i.e.,prior(q 1 ) =P(q 1 |q 0 ) (^2) -LRB-and-RRB-are the tags used for the left parenthesis tag ‘(’ and the right parenthesis tag, ‘)’.