You can manually add the allowable tags for certain "domain-specific" words to the tagging model.
You must edit the files word.voc and tagdict in the project directory.
word.voc contains the common words of the training set, where a word is common if it occurs a least 5 times in the training data.
tagdict describes the list of allowable tags (induced from training data) for each common word.
In word.voc, add each new word in the format:
(The "0" is necessary for compatibility reasons).
Next, edit the file tagdict by adding each new word and a list of its allowable tags:
<word> <tag_1> <tag_2> ... <tag_N>
MXPOST relies critically on the assumption that input sentences are tokenized according to the Penn Treebank conventions.
The correct tokenization for the above sentence is:
The `` stock '' rose $ .50 -LRB- to $ 5.00 -RRB- .
MXPOST's (correct) answer is then:
The_DT ``_`` stock_NN ''_'' rose_VBD $_$ .50_CD -LRB-_-LRB- to_TO $_$ 5.00_CD -RRB-_-RRB- ._.
The Penn treebank web page has a sample sed script which should serve as a starting point for your tokenization problems.