This week I started working on grouping the languages based on part of speech (POS) sequences. I started by creating a list of all of the sequences with up to five POS labels from all of the training data. From this I was able to create a matrix of the probability of one sequence over another for every language we have data for. Next I ran a program that would perform a cosine similarity measurement. To do this the computer takes every line in the matrix and makes it a vector similar to what you would see in a math or physics class just with more components. Then it takes the vectors and uses a cosine function on them to see which ones are most similar and groups them together. Unfortunately I was unable to get an actual grouping this week because when I ran the cosine similarity program it errored. My adviser and I believe it is because when the sequence is up to five POS labels long the matrix goes from several thousand columns to over a million. We believe this larger data set is too much for the computer to handle so we decided to use a smaller sequence for next week.
0 Comments
|
|