Saturday, May 9, 2009

Analyzing Movable Type search logs

The OutSystems blog uses Movable Type. I was curious to see what type of searches were being made on the blog, so I downloaded the logs from MT's back-office and started thinking how I could analyze them.

I wanted to generate CSV files, so I could open them in Excel or Numbers. At first I considered Python, because it has good CSV support. But most of the work would consist in parsing the MT log file, so then I considered Perl. The problem with Perl is that I can never remember the syntax, so in the end I decided to just use the Unix command line.

I started out by checking unique sentences in the file:

grep "Search: query for" logfile.csv | cut -d \' -f 2 | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr | sed 's/ *\([0-9]*\) \(.*\)/\2, \1/' > sentences.csv

Here's what this does:
  1. grep "Search: query for" logfile.csv - Get all the lines from the log that are searches
  2. cut -d \' -f 2 - Extract the content of the search. This might not work if the content has ' on it, so be advised.
  3. tr '[A-Z]' '[a-z]' - Turn everything to lowercase.
  4. sort - Group the sentences (for uniq -c to work).
  5. uniq -c - Count the unique occurrences of each sentence.
  6. sort -nr - Sort by numbers, in reverse order.
  7. sed 's/ *\([0-9]*\) \(.*\)/\2, \1/' - Transform the result into a CSV file.
  8. > sentences.csv - Save to sentences.csv.
Turns out this wasn't very useful, because there are a lot of different sentences. One of them kind of stand out, but it accounted for about 5% of the searches. So I've added a bit of awk to do the same for the words, instead of the sentences:

grep "Search: query for" logfile.csv | cut -d \' -f 2 | tr '[A-Z]' '[a-z]' | awk '{for (i=1;i<=NF;i++) print $i}' | sort | uniq -c | sort -nr | sed 's/ *\([0-9]*\) \(.*\)/\2, \1/' > words.csv

The awk script splits the sentences into words. This issue a more interesting result: 26% of the searches include the word "Agile"! Note that this counts words like "and" and "the", but it's easy enough to remove them.

There's some more fun stuff you can do with this! Use Wordle to create a word cloud:

Or you can check where the accesses come from. If you have an IP geolocation DB, you can try this:

grep "Search: query for" $1 | cut -d , -f 2 | sort | uniq | awk '{print "SELECT country_name FROM ip_group_country where ip_start <= INET_ATON(\""$1"\") order by ip_start desc limit 1;"}' | mysql --skip-column-names -B -uGeoDBUser -pGeoDBPwd GeoDB | sort | uniq -c | sort -nr | sed 's/ *\([0-9]*\) \(.*\)/\2, \1/' > geo.csv

You can then open the file and make a nice chart! Have fun!