Tuesday, April 26, 2005

5 Must Watch Movies - Thanks Orkut!



Summary

In this article, I describe how I generated a list of most cited books and movies from the profiles of the people who are connected to me at Orkut, Google's online social-networking service. Orkut does not provide any Web-Services at this time that allow for writing such a tool, so I wrote a script in Perl that used the WWW::Mechanize module available from CPAN and parsed the HTML of the webpages that contained the required data. The results were very satisfactory, and I strongly believe that there is a whole wealth of information that can be discovered by aggregating the user-data available in social-networks. Perhaps, Google should consider providing statistics to each user in Orkut, about the most popular books and movies among his/her friends, and link each of them to Amazon, or another shopping website that exposes Web-Services.


The Idea

People generally mention only the movies they enjoyed watching most in their profiles. So for a given set of people, a movie that is mentioned the most by people in their profiles would definitly be more worth watching than the one that is mentioned comparatively lesser number of times.

Hence, I felt that a great way to determine the movies that I must watch would be to aggregate information about the movies from the profile pages of all my friends at Orkut, and all the other people that I am connected to through them. If I can further arrange the obtained set of movies in descending order of the number of times each movie is mentioned, I would be able to easily decide which movie I must go to the DVD store and ask about next!

So, what did I do?

Well, armed with Perl and WWW::Mechanize I just wrote one such hack-toy. So, ladies and gentlemen, spiders and scrappers, lo and behold... I give you the movies with the Top-10 most cited movies in the profiles of my friends and friends-thereof at Orkut.
  1. Life is Beautiful (20)
  2. Matrix (20)
  3. DDLJ (A Bollywood Movie) (15)
  4. Shawshank Redemption (15)
  5. Sholay (Another Bollywood Movie) (12)
  6. Forrest Gump (12)
  7. Gladiator (11)
  8. Anand (Yet Another Bollywood Movie) (11)
  9. Braveheart (9)
  10. Titanic (8)


How it works

The algorithm to do this is quite simple. You, your friends, friends of your friends, and so-on, can be arranged in a directed graph structure such that the vertices of the graph correspond to the users while the edges correspond to the function has-friend. So, if B is a friend of A, simply draw a directed edge from A to B. Consider the following graph structure:


(Enlarge Image)


The graph depicts that I have a friend called A, and A also considers me a friend. A has a friend called C who considers me as a friend. A has another friend called B, who has a friend called D and D considers me as a friend too.

The problem now is to simply visit all the vertices in the graph and build a hash of movie-names and the citation counts by processing the profile page of the user corresponding to the vertex currently being processed.

But hey, there is a slight problem!

Orkut on its home-page tells me that:
You are connected to 5283820 people through 21 friends.

Assuming that I am able to fetch and analyze the profile of each person in 1.5 seconds (Slow Network IO. Update on 10/01/2013: Woah! My internet connection was slooow back then), it would take a total of about 92 days to analyze all the persons that I am connected to at Orkut. 92 days!

Hence, I would want to start at me in the grapth and be able to restrict the analysis to people who are at a maximum distance max_distance from me. Additionaly, I should be able to provide max_distance to the algorithm.

The Algorithm

We can visit all the vertices in the graph by adopting the depth-first-retrieval technique starting at me in the graph, but with a slight modification. We would like to stop the DFS if the distance of the current vertex (user) in the graph is greater than or equal to max_distance (I am at a distance of zero from myself). Hence, we can have a recursive algorithm that would look something like:

sub find_movies {
    my $uid              = shift;
    my $current_distance = shift;
    my $max_distance     = shift;

    process_user($uid);
    $visited{$uid} = 1;

    # Process friends of current-user, only if his distance is less
    # than the maximum allowed distance from the start-user.

    if ($current_distance < $max_distance) {
        foreach (find_friends_of($uid)) {
            unless ($visited{$_}) {
                find_movies($_, $current_distance + 1, $max_distance);
            }
        }
    }
}

The process_user function processes the profile page (Profile.aspx) corresponding to the specified user and then updates the hash of movies. The find_friends_of function process the network of friends mentioned at Orkut (FriendsNet.aspx) and returns a list containing the identifiers of the users that are listed as friends of the specified user at Orkut.

find_movies will be initially invoked as find_movies($your_user_id, 0, $max_distance) and the visited hash would be initially empty. After the find_movies function call returns, we will have to sort the values in the movies hash that would contain all the movies mapped to their citation counts and print them out.

Further Work

The next most important thing is to figure out a way to do an approximate comparison of movie names because different people tend to spell the names of non English language movies differently. The text is mostly free-form which brings a number of issues into the picture. E.g. "The Banana", "Banana", "Banana...", "Banana...etc." are all different strings but could be the same movie.

Most Popular Books

I used the same technique to determine the names of the books cited most by my friends and friends thereof in their Orkut profiles. And, here're the Top-5 most-cited books:

  1. Fountain Head (21)
  2. Da Vinci Code (16)
  3. The Godfather (15)
  4. Catch 22 (13)
  5. Harry Potter Series (12)

The Cloud of Books

Here is a longer list of books in a cloud-like fashion:




Want an even longer list of movies and books?

A longer list of most-cited books and movies is available here.

Conclusion

There is a lot of precious information that can be discovered by aggregating the information in the profiles of users. This is just one example. I have some other ideas too, so keep watching - Carry On Coding! :)
Perhaps, Google could give thought to providing statistics about the most popular books and movies among the friends of a given user, and link each of them to Amazon, or another shopping site that exposes Web-Services.

Source Code?!

I have intentionally left out the gory details of my implementation, and not provided the source-code of the script, because a number of people using the script could possibly generate a lot of load on Orkut's servers and I don't want the Orkut people coming after me! :-p

I hope you liked reading this write up...

16 comments:

bheema v said...

Nice, fast work.

And now send me the code.

Sid said...

I just applied the same technique to determine the names of the books most cited by my friends and friends-thereof at Orkut. Check the post for results!

I trust you... I'll send you the code! :)

Alpha0 said...

Great. Great. Great.

Can you provide an exhaustive list of movies and books?

Few Questions:
1. How did you login using script?
2. How did you make the matches while doing the aggregation of books and movies because people might write 'The Matrix' or 'Matrix'. Some just put "....." and "all the movies of Arnold".
How did you tacle this human created chaos?

Sid said...

Great. Great. Great.

Thanks. Thanks. Thanks! :)

Can you provide an exhaustive list of movies and books?

Check the post for the lists of Top-5 movies and books mentioned in the profiles of my friends and friends-thereof.

Few Questions:
1. How did you login using script?


I used WWW::Mechanize Perl module and logged in by filling out the credentials on the Login.aspx page before accessing any other pages. WWW::Mechanize makes all of it very easy!

2. How did you make the matches while doing the aggregation of books and movies because people might write 'The Matrix' or 'Matrix'. Some just put "....." and "all the movies of Arnold".
How did you tacle this human created chaos?


I realized these problems and finding solutions to them falls under "Future Work". I have some ideas and hope to solve them reasonably in the next version of the script.

Alpha0 said...

I mean all books and movies.

Mechanize real bas...d..

Suggestions for matching:
1. remove "^The "
2.
foreach(word in name1){
if(name2.contains(word))
count1++;
totalNumWords1++;
}

foreach(word in name2){
if(name1.contains(word))
count2++;
totalNumWords2++;
}

matchdegree = (count1+count2)/(totalNumWords1 + totalNumWords2).

Sid said...

BTW I implemented this hack at around 3 AM.

I just updated this post with the section entitled - But hey, there is a slight problem!. And while I was at it, also added the customary Conclusion section.

I just generated a list of all books cited by all the people who are connected to me at Orkut at a maximum distance of 2.

The script generated a CSV file containing the names of the books along with their citation counts and I would mail the 20KB file out to you.

BTW, Get a decent handle "AlphaQ" :)

Alpha0 said...

Cool.
I was just going through some website displaying the top 10 books. And strangely enough "Fountain Head" was nowhere.
Just check it out: http://www.abc.net.au/myfavouritebook/top10/default.htm

Sid said...

I was just going through some website displaying the top 10 books. And strangely enough "Fountain Head" was nowhere.

Well, all the results mean is that "Fountain Head" appeared the most in the list of books mentioned by people in their Orkut profiles who were at a maximum distance of 2 from me. The citation count is the only criteria that decided the order in which my script listed the books, which is to say that, you could arrange them in another order based on another criteria.

The idea is that if most people have mentioned a certain book in their profiles, and those people are linked to me via some friends, those books certainly must be worth reading.

My script just helps you find the set of books listed by people in their profiles, and the number of times each book was cited. If a book is cited more times by another, it must certainly be more worth reading than the other.

I am sure you would be able to find many other lists around where "Fountain Head" would appear first in the list of top books.

Neel said...

Sid..

nice piece of ideas & code followed by good effort.. :)

Regarding the problem of slowness:

This program can be written in some beautiful language like java and you can keep assigning threads to get adjacent vertices list of a given vertex and then assign another set of threads in parallel to get the movie information from the profile of various vertices..

now you can have a separate jobs to convert the above movie names into meaning full "key name" (explained below) -> count value pair or even above jobs can do tht..


Now regarding the problem of movie name written differently or converting various style / versions of same movie name to a same "Key name":

this is very standard problem and on web there is lot of junk available for this..

basically you need to
1. do the translation to english if reqd...
2. remove stop words like a, the etc.
3. convert the words to their morphological roots or do stemming. there is standard implementation of porter algo on net for this..

anyway I am just throwing light not discussing exact details..

Though it will interesting to crawl the whole orkut and get some meaningful information out of it sometime :)

hey, nice work again :)

Anonymous said...

Hi Sid,

I am a non-techi, but need your help on a forum at orkut- http://www.orkut.com/CommTopics.aspx?cmm=11344550

lemme know if I could buzz you sometime.

MedRite
medrite@gmail.com

Anonymous said...

Hi Sid,

I am a non-techi, but need your help on a forum at orkut- http://www.orkut.com/CommTopics.aspx?cmm=11344550

lemme know if I could buzz you sometime.

MedRite
medrite@gmail.com

Anonymous said...

nice work

mitanshu said...

Nice work dude!. Check this post : http://mitanshu.blogspot.com/2006/06/indias-growth-in-orkut.html

Vili Maunula said...

An interesting idea, and one that I have not thought about. I have, however, lately approached the issue somewhat differently by monitoring blogs, websites, news media and what have you. You can see the results at http://www.moviesay.com/ .

I hope you don't consider this as spamming your blog. The subject matter simply seemed like such a perfect match to my website that I couldn't resist mentioning it. :)

ramya said...

hi..

when i tried to use the module WWW::Mechanize in my program i go tthe following error

Can't locate object method "decoded_content" via package "HTTP::Headers" at (eval 14) line 1.

are there any modules that i need to have for WWW::Mechanize to work?

Anonymous said...

> ramya,

you require a newer version of HTTP::Message, just get source from CPAN and install locally.