SteamMusings is a Twitter bot written in C# which makes (occasionally nonsensical) quips about the PC games available on Valve's digital distribution service, Steam.
HOW DOES IT WORK?
The concept behind SteamMusings is simple: pick a random game available on steam, gather user for reviews for that game, extract significant terms from those reviews, and craft a tweet using one of those terms. But this simple algorithm belies a lot of underlying complexity. Let's examine each step in more detail.
PICKING A GAME & GATHERING REVIEWS
Picking a game is reasonably straightforward. SteamMusings uses Steam's web interface to scrape a list of all titles available on Steam and then picks a random game with 50 or more user reviews from that list. The more reviews a game has, the easier it becomes to determine which terms are significant.
Once a game has been selected, SteamMusings uses Steam's (undocumented) reviews API to gather several dozen user reviews for that game. It then produces a word count for each word which appears in the review. But this is where things start to get interesting...
DETERMINING SIGNIFICANT TERMS
Relying on raw word-counts to determine which terms are significant produces very bad results. The most common words in Steam reviews tend to be the most common words in the English language ("the", "a", "of", "to", etc.) What we really want to do is to find the words which occur most often relative to all other reviews on Steam. There's actually a statistic used to measure this value; it's called the "term frequency–inverse document frequency" statistic, or tf-idf for short. But in order to calculate the tf-idf for a given word, you need to know how frequently that word occurs in a given body of text. That is to say, we need to know often any given word occurs in a typical Steam review.
So, in order to determine the inverse-document frequency statistic I wrote a separate program to gather reviews from every single game on Steam. Running this program takes 4-5 hours! But when the program is finished, it produces a very handy chart containing the frequency with which a given English word occurs in Steam reviews. (Did you know that the word "graphics" appears in approximately 12% of Steam reviews? It's true!)
By dividing the term frequency (the word count) by the document frequency (the rate at which a word appears in the "average" steam review) it becomes possible to identify terms which are significant for reviews of our chosen game. SteamMusings then picks one of the most-significant terms at random to use in its tweet. But we're not finished yet...
CRAFTING A TWEET
We have the name of the game, and a word associated with it. But how do we combine the two? The answer is that it depends on what kind of word we're dealing with. Adjectives must be treated differently from nouns, etc. I tried several different part-of-speech APIs to classify words, but I ran into a certain amount of difficulty with words such as "good" which can be used in many different ways. Ultimately, I settled on using Dictionary.com's web interface which works pretty well, even if it was not intended for this purpose.
Once we've classified our word, it's simply a matter of slotting it into a mad-libs style phrase and posting it to Twitter using TweetSharp. For example, "[Game Name] sure is [Adjective]!"
Special thanks to Daniel Crenna who wrote the open-source TweetSharp library which SteamMusings uses to post to Twitter.