Beer Recommender

The data

For this example, we'll use data from Beer Advocate, a community of beer enthusiasts and industry professionals dedicated to supporting and promoting beer. You can download the data here.

Each record is composed of a beer's name, brewery, and metadata like style and ABV etc., along with ratings provided by reviewers. Beers are graded on appearance, aroma, palate, and taste plus users provide an "overall" grade. All ratings are on a scale from 1 to 5 with 5 being the best.

Formatting the Data

First we'll read in our beer_reviews.csv file:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pprint as pp
import os

filename = os.path.join("beer_reviews.csv")
df = pd.read_csv(filename)
# let's limit things to the top 250
n = 250
top_n = df.beer_name.value_counts().index[:n]
df = df[df.beer_name.isin(top_n)]
library(reshape2)
library(plyr)

df <- read.csv("./data/beer_reviews.csv")
n <- 250
beer_counts <- table(df$beer_name)
beers <- names(beer_counts[beer_counts > n])
df <- df[df$beer_name %in% beers,]

Calling head() above will return the following:

df.head()
      brewery_id             brewery_name  review_time  review_overall  \
798         1075  Caldera Brewing Company   1212201268             4.5
1559       11715  Destiny Brewing Company   1137124057             4.0
1560       11715  Destiny Brewing Company   1129504403             4.0
1563       11715  Destiny Brewing Company   1137125989             3.5
1564       11715  Destiny Brewing Company   1130936611             3.0

      review_aroma  review_appearance review_profilename  \
798            4.5                  4             grumpy
1559           3.5                  4    blitheringidiot
1560           2.5                  4        NeroFiddled
1563           3.0                  4    blitheringidiot
1564           3.0                  3             Gavage

                            beer_style  review_palate  review_taste  \
798   American Double / Imperial Stout            4.0           4.5
1559           American Pale Ale (APA)            3.5           3.5
1560           American Pale Ale (APA)            4.0           3.5
1563                      American IPA            4.0           4.0
1564                      American IPA            4.0           3.5

           beer_name  beer_abv  beer_beerid
798   Imperial Stout       NaN        42964
1559        Pale Ale       4.5        26420
1560        Pale Ale       4.5        26420
1563             IPA       NaN        26132
1564             IPA       NaN        26132
head(df)

    brewery_id            brewery_name review_time review_overall review_aroma review_appearance review_profilename   beer_style review_palate review_taste
11         163  Amstel Brouwerij B. V.  1010963392            3.0            2                 3            fodeeoz  Light Lager           2.5          2.5
19         163  Amstel Brouwerij B. V.  1010861086            2.5            3                 3             jdhilt  Light Lager           2.0          2.0
31         163  Amstel Brouwerij B. V.  1002109880            3.0            2                 2          xXTequila  Light Lager           2.0          3.0
41         163  Amstel Brouwerij B. V.   988202869            3.0            3                 3              Brent  Light Lager           2.0          2.0
258       1075 Caldera Brewing Company  1272945129            4.0            4                 4              Akfan American IPA           4.0          4.5
266       1075 Caldera Brewing Company  1324238653            4.0            4                 4          coldriver American IPA           4.0          4.5
       beer_name beer_abv beer_beerid
11  Amstel Light      3.5         436
19  Amstel Light      3.5         436
31  Amstel Light      3.5         436
41  Amstel Light      3.5         436
258  Caldera IPA      6.1       10784
266  Caldera IPA      6.1       10784

Let's continue by aggregating our data.

print "melting..."
df_wide = pd.pivot_table(df, values=["review_overall"],
                         rows=["beer_name", "review_profilename"],
                         aggfunc=np.mean).unstack()

# any cells that are missing data (i.e. a user didn't buy a particular product)
# we're going to set to 0
df_wide = df_wide.fillna(0)
df.wide <- dcast(df, beer_name ~ review_profilename,
            value.var='review_overall', mean, fill=0)
head(df.wide)

dists <- dist(df.wide[,-1], method="euclidean")
dists <- as.data.frame(as.matrix(dists))
colnames(dists) <- df.wide$beer_name
dists$beer_name <- df.wide$beer_name

Calculating the Distance

The goal for our system will be for a user to provide us with a beer that they know and love, and for us to recommend a new beer which they might like. To accomplish this, we're going to use cosine similarity.

# this is the key. we're going to use cosine_similarity from scikit-learn
# to compute the distance between all beers
print "calculating similarity"
dists = cosine_similarity(df_wide)

# stuff the distance matrix into a dataframe so it's easier to operate on
dists = pd.DataFrame(dists, columns=df_wide.index)

# give the indicies (equivalent to rownames in R) the name of the product id
dists.index = dists.columns

def get_sims(products):
    """
    get_top10 takes a distance matrix an a productid (assumed to be integer)
    and will calculate the 10 most similar products to product based on the
    distance matrix

    dists - a distance matrix
    product - a product id (integer)
    """
    p = dists[products].apply(lambda row: np.sum(row), axis=1)
    p = p.order(ascending=False)
    return p.index[p.index.isin(products)==False]

get_sims(["Sierra Nevada Pale Ale", "120 Minute IPA", "Coors Light"])
# Index([u'Samuel Adams Boston Lager', u'Sierra Nevada Celebration Ale', u'90 Minute IPA', u'Arrogant Bastard Ale', u'Stone IPA (India Pale Ale)', u'60 Minute IPA', u'HopDevil Ale', u'Stone Ruination IPA', u'Sierra Nevada Bigfoot Barleywine Style Ale', u'Storm King Stout', u'Samuel Adams Winter Lager', u'Samuel Adams Summer Ale', u'Prima Pils', u'Anchor Steam Beer', u'Old Rasputin Russian Imperial Stout', u'Samuel Adams Octoberfest', ...], dtype='object')
getSimilarBeers <- function(beers_i_like) {
  beers_i_like <- as.character(beers_i_like)
  cols <- c("beer_name", beers_i_like)
  best.beers <- dists[,cols]
  if (ncol(best.beers) > 2) {
    best.beers <- data.frame(beer_name=best.beers$beer_name, V1=rowSums(best.beers[,-1]))
  }
  results <- best.beers[order(best.beers[,-1]),]
  names(results) <- c("beer_name", "similarity")
  results[! results$beer_name %in% beers_i_like,]
}
getSimilarBeers(c("Coors Light"))

Wrap it in ŷhat

Now that we have a model we're going to use the ŷhat client to deploy our model as web service.

Defining our code

In this example, we're going to be invoking our model and then formatting the response into a dictionary. We have the incoming request formatted as:

{
    "beers": [
      "Sierra Nevada Pale Ale",
      "120 Minute IPA",
      "Stone Ruination IPA"
    ]
}

Add in the ScienceOps part:

Structure your model so that it will handle incoming data and then pass it to the appropriate function for making a recommendation.

from yhat import Yhat, YhatModel, preprocess

class BeerRecommender(YhatModel):
    @preprocess(in_type=dict, out_type=dict)
    def execute(self, data):
        beers = data.get("beers", [])
        # handle utf8 characters in beer names, e.g. "Rauch Ür Bock"
        beers = [beer.encode("utf8") for beer in beers]
        suggested_beers = get_sims(beers)
        result = []
        for beer in suggested_beers:
            result.append({"beer": beer})
        return result
model.predict <- function(df) {
  getSimilarBeers(df$beers)
}

Test your model locally

We recommend you test your model locally before deploying. After all, if it doesn't work locally, it won't work on ScienceOps!

BeerRecommender().execute({"beers": ["Sierra Nevada Pale Ale","120 Minute IPA", "Stone Ruination IPA"]})
testdata <- list(beers=c("Sierra Nevada Pale Ale","120 Minute IPA", "Stone Ruination IPA"))

model.predict(testdata)

Deploy your model to ScienceOps:

The hard part is over. Now that we have our own model, it's time to deploy it! Create a connection to the ScienceOps server, then execute the deploy function.

yh = Yhat("USERNAME", "APIKEY", "https://sandbox.yhathq.com")
yh.deploy("BeerRecommender", BeerRecommender, globals())
# {"status": "success"}
library(yhatr)
yhat.config <- c(
    username="USERNAME",
    apikey="API_KEY",
    env="https://sandbox.yhathq.com/"
)
yhat.deploy("BeerRecommender")

Get Predictions via the REST API

Predictions from the model can be sent and received from the languages below in addition to others found on the REST API page.

yh.predict("BeerRecommender",{"beers": ["Sierra Nevada Pale Ale","120 Minute IPA", "Stone Ruination IPA"]})
testdata <- list(beers=c("Sierra Nevada Pale Ale","120 Minute IPA", "Stone Ruination IPA"))
yhat.predict("BeerRecommender", testdata)

results matching ""

    No results matching ""