Automatic Tagging

September 23rd, 2013

Challenge

Your challenge, should you choose to accept it, is to implement an efficient and effective supervised algorithm that allows short text documents (comprising of a title heading and a paragraph or two of body text) to be automatically tagged. That is, devise a method that assigns one or more tags  \{T_1,T_2,...T_M\} to each of the given documents,  \{D_1, D_2, ...., D_N\} .

To make the challenge more interesting the training set is pretty large.

  • N ~ 6  million,
  • M ~ 40,000.
  • Total of 17M tags.
  • Each question has 1-5 tags, typically 2-3.
  • Most common tag is used ~400k times.

Read the rest of this entry »

Startling word discovery

May 25th, 2012

My eight year-old daughter came home the other day and wrote the word 'startling' on a piece of paper and put it in front of my face. I have come to learn that this is her way of saying,

"Daddy, let's play this word game that I learnt at school!
Oh, and by the way, let's skip the boring part where I tell you the rules.
Instead let's just play and when you make a mistake I'll tell you."

I'll be a bit kinder and explains the rules at the start: You have to pick a letter so that the remaining letters still form a word. (So in the first round, I could have picked the letter 't', or 'l', which would result in the words 'starling' or 'starting', respectively.)

You win the game if you correctly identify a sequence of letters to eliminate at each state, so that your last valid word is only a single letter long. For example, here is one way you can produce a winning sequence:

startling, starting, staring, string, sting, sing, sin, in, I.

She then said that her teacher had told the class that this was the longest word where you can go all the way down to 1-letter. Being as curious as I am, I thought I would try to verify this. Furthermore, I was intrigued as to what other words had this characterisitic.

Read the rest of this entry »