Multilingual Parsing to Universal Dependencies

2/9/2017

This week I created a basic parsing program trained on a small sample of English data. I did this by taking a base program and feeding it English sentences and telling it the right way to parse the sentence. By doing this the program learns about the correct way to parse an English sentence. The particular parser I created was trained on about 2,000 sentences in which the training took over an hour. After this the program could parse English sentences into the universal dependencies format where it would list the words position in the sentence, the actual word, the lemma (word stem), the part of speech, the language specific part of speech, any other features of the word, what word "points" to it, and how the "pointing" word relates to it. When a word points to another word it is called a head, the program is designed to give every word a head where the main verb gets a head from an imaginary first word, root. From there the verb points to any other helping verbs, the subject, and the predicate. For example the result of running the program with the sentence "The boy walked across the street" gives the following result:

1   The   the   DET   DT   Definite=Def|PronType=Art   2   det   _   _
2   boy   boy   NOUN   NN   Number=Sing   3   nsubj   _   _
3   walked   walke   VERB   VBD   Mood=Ind|Tense=Past|VerbForm=Fin   0   root   _   _
4   across   across   ADP   IN   _   6   case   _   _
5   the   the   DET   DT   Definite=Def|PronType=Art   6   det   _   _
6   street   street   NOUN   NN   Number=Sing   3   nmod   _   SpaceAfter=No
7   .   .   PUNCT   .   _   3   punct   _   _

This raw output doesn't look very nice but it has all the information I mentioned before. The first column is the position of the words in the sentence and the second is the actual words. Looking at the second column we can identify two words that aren't in their root form or lemma. The program correctly identifies them as "The" and "walked". The first "The" is capitalized so its lemma is just "the" but the lemma for "walked" is actually "walk" instead of "walke". Since the computer only had limited training data it wasn't able to correctly identify the lemma of walked but other than this small mistake the sentence was correctly parsed into the universal dependencies format.

2 Comments

Basic Parsing