By Miranda Louwerse (Contributor) – Email
Print Edition: February 25, 2015
Math and romance were the topics of a recent lecture by Thomas Levi, a senior data scientist at PlentyOfFish, the world’s largest free dating site. The event, which was hosted by the math club, drew students and faculty from the CIS and math departments among a variety of other disciplines. Levi spoke about the development of the system used by PlentyOfFish to match users and described the system’s functions, offering UFV students the chance to see the applications of the mathematics they learn.
PlentyOfFish allows millions of users to enter their interests easily as “free text” rather than a questionnaire, which creates complications in terms of programming. The problem is that free text offers no commonalities in the gathering and grouping of data due to spelling mistakes, synonyms, and similar interests. The goal of the system is instead to create an archetype of the user so that they can be matched to similar users.
Levi described the process and general method of Latent Dirichlet Allocation (LDA). LDA is a model that explains why some parts of a data set are similar by organizing the data into groups that were not originally there. In this application, the words users enter as interests are split into groups that are not observed in the data set, but that humans would create, such as groups of interests such as outdoor sports or TV shows.
The process begins by taking every word entered by users, which Levi describes as a several hundred thousand word vocabulary. After preprocessing the data to fix spelling mistakes, remove common words, and truncate longer words, the program sets a fixed number of topics for the words to be fit into. Then LDA is run to find the word distributions within each topic. Levi showed an example output from LDA that demonstrated its accuracy. One group, which Levi labelled sports, had words like baseball and hockey as high probability and another group labelled TV had Big Bang Theory and Game of Thrones as high probability, while words like “shopping” or “flowers” would likely have a lower probability of appearing in each of these groups.
So how does this work for users? When a user inputs their interests, the program creates a topic mixture vector, which basically describes how much of the users’ interests are in each topic. For example, a user could be typified as 30 per cent outdoor sports, 10 per cent culture, 40 per cent nerdy TV, and 20 per cent intellectual — all based on the interests that they input. The user is then matched to others who share similar archetypes by finding the smallest difference between topic vectors. They can also search for certain interests in others. For example, a user can search “Game of Thrones” to find people who like “Game of Thrones,” or even other similar TV shows.
Such a program understands that people are complex and typically have a wide variety of interests. This is an astounding illustration of the complexity to which computer programming and mathematics has risen. Using LDA to solve the complex problem of matching two persons based on its own determination of their interests shows the power of mathematics and computing in the real world, and hints at the progress which is to come.
The math club is hosting another talk on Thursday, March 19. Details will be posted on the math club web page.