Tuesday, April 26, 2005

5 Must Watch Movies - Thanks Orkut!



Summary

In this article, I describe how I generated a list of most cited books and movies from the profiles of the people who are connected to me at Orkut, Google's online social-networking service. Orkut does not provide any Web-Services at this time that allow for writing such a tool, so I wrote a script in Perl that used the WWW::Mechanize module available from CPAN and parsed the HTML of the webpages that contained the required data. The results were very satisfactory, and I strongly believe that there is a whole wealth of information that can be discovered by aggregating the user-data available in social-networks. Perhaps, Google should consider providing statistics to each user in Orkut, about the most popular books and movies among his/her friends, and link each of them to Amazon, or another shopping website that exposes Web-Services.


The Idea

People generally mention only the movies they enjoyed watching most in their profiles. So for a given set of people, a movie that is mentioned the most by people in their profiles would definitly be more worth watching than the one that is mentioned comparatively lesser number of times.

Hence, I felt that a great way to determine the movies that I must watch would be to aggregate information about the movies from the profile pages of all my friends at Orkut, and all the other people that I am connected to through them. If I can further arrange the obtained set of movies in descending order of the number of times each movie is mentioned, I would be able to easily decide which movie I must go to the DVD store and ask about next!

So, what did I do?

Well, armed with Perl and WWW::Mechanize I just wrote one such hack-toy. So, ladies and gentlemen, spiders and scrappers, lo and behold... I give you the movies with the Top-10 most cited movies in the profiles of my friends and friends-thereof at Orkut.
  1. Life is Beautiful (20)
  2. Matrix (20)
  3. DDLJ (A Bollywood Movie) (15)
  4. Shawshank Redemption (15)
  5. Sholay (Another Bollywood Movie) (12)
  6. Forrest Gump (12)
  7. Gladiator (11)
  8. Anand (Yet Another Bollywood Movie) (11)
  9. Braveheart (9)
  10. Titanic (8)


How it works

The algorithm to do this is quite simple. You, your friends, friends of your friends, and so-on, can be arranged in a directed graph structure such that the vertices of the graph correspond to the users while the edges correspond to the function has-friend. So, if B is a friend of A, simply draw a directed edge from A to B. Consider the following graph structure:


(Enlarge Image)


The graph depicts that I have a friend called A, and A also considers me a friend. A has a friend called C who considers me as a friend. A has another friend called B, who has a friend called D and D considers me as a friend too.

The problem now is to simply visit all the vertices in the graph and build a hash of movie-names and the citation counts by processing the profile page of the user corresponding to the vertex currently being processed.

But hey, there is a slight problem!

Orkut on its home-page tells me that:
You are connected to 5283820 people through 21 friends.

Assuming that I am able to fetch and analyze the profile of each person in 1.5 seconds (Slow Network IO. Update on 10/01/2013: Woah! My internet connection was slooow back then), it would take a total of about 92 days to analyze all the persons that I am connected to at Orkut. 92 days!

Hence, I would want to start at me in the grapth and be able to restrict the analysis to people who are at a maximum distance max_distance from me. Additionaly, I should be able to provide max_distance to the algorithm.

The Algorithm

We can visit all the vertices in the graph by adopting the depth-first-retrieval technique starting at me in the graph, but with a slight modification. We would like to stop the DFS if the distance of the current vertex (user) in the graph is greater than or equal to max_distance (I am at a distance of zero from myself). Hence, we can have a recursive algorithm that would look something like:

sub find_movies {
    my $uid              = shift;
    my $current_distance = shift;
    my $max_distance     = shift;

    process_user($uid);
    $visited{$uid} = 1;

    # Process friends of current-user, only if his distance is less
    # than the maximum allowed distance from the start-user.

    if ($current_distance < $max_distance) {
        foreach (find_friends_of($uid)) {
            unless ($visited{$_}) {
                find_movies($_, $current_distance + 1, $max_distance);
            }
        }
    }
}

The process_user function processes the profile page (Profile.aspx) corresponding to the specified user and then updates the hash of movies. The find_friends_of function process the network of friends mentioned at Orkut (FriendsNet.aspx) and returns a list containing the identifiers of the users that are listed as friends of the specified user at Orkut.

find_movies will be initially invoked as find_movies($your_user_id, 0, $max_distance) and the visited hash would be initially empty. After the find_movies function call returns, we will have to sort the values in the movies hash that would contain all the movies mapped to their citation counts and print them out.

Further Work

The next most important thing is to figure out a way to do an approximate comparison of movie names because different people tend to spell the names of non English language movies differently. The text is mostly free-form which brings a number of issues into the picture. E.g. "The Banana", "Banana", "Banana...", "Banana...etc." are all different strings but could be the same movie.

Most Popular Books

I used the same technique to determine the names of the books cited most by my friends and friends thereof in their Orkut profiles. And, here're the Top-5 most-cited books:

  1. Fountain Head (21)
  2. Da Vinci Code (16)
  3. The Godfather (15)
  4. Catch 22 (13)
  5. Harry Potter Series (12)

The Cloud of Books

Here is a longer list of books in a cloud-like fashion:




Want an even longer list of movies and books?

A longer list of most-cited books and movies is available here.

Conclusion

There is a lot of precious information that can be discovered by aggregating the information in the profiles of users. This is just one example. I have some other ideas too, so keep watching - Carry On Coding! :)
Perhaps, Google could give thought to providing statistics about the most popular books and movies among the friends of a given user, and link each of them to Amazon, or another shopping site that exposes Web-Services.

Source Code?!

I have intentionally left out the gory details of my implementation, and not provided the source-code of the script, because a number of people using the script could possibly generate a lot of load on Orkut's servers and I don't want the Orkut people coming after me! :-p

I hope you liked reading this write up...