Recommendation

Analytics Zoo provides three Recommenders, including Wide and Deep (WND) model, Neural network-based Collaborative Filtering (NCF) model and Session Recommender model. Easy-to-use Keras-Style defined models which provides compile and fit methods for training. Alternatively, they could be fed into NNFrames or BigDL Optimizer.

WND and NCF recommenders can handle either explict or implicit feedback, given corresponding features.

We also provide three user-friendly APIs to predict user item pairs, and recommend items (users) for users (items). See here for more details.

Wide and Deep

Wide and Deep Learning Model, proposed by Google, 2016, is a DNN-Linear mixed model, which combines the strength of memorization and generalization. It's useful for generic large-scale regression and classification problems with sparse input features (e.g., categorical features with a large number of possible feature values). It has been used for Google App Store for their app recommendation.

After training the model, users can use the model to do prediction and recommendation.

Scala

val wideAndDeep = WideAndDeep(modelType = "wide_n_deep", numClasses, columnInfo, hiddenLayers = Array(40, 20, 10))

modelType: String. "wide", "deep", "wide_n_deep" are supported. Default is "wide_n_deep".
numClasses: The number of classes. Positive integer.
columnInfo An instance of ColumnFeatureInfo.
hiddenLayers: Units of hidden layers for the deep model. Array of positive integers. Default is Array(40, 20, 10).

See here for the Scala example that trains the WideAndDeep model on MovieLens 1M dataset and uses the model to do prediction and recommendation.

Python

wide_and_deep = WideAndDeep(class_num, column_info, model_type="wide_n_deep", hidden_layers=(40, 20, 10))

class_num: The number of classes. Positive int.
column_info: An instance of ColumnFeatureInfo.
model_type: String. 'wide', 'deep' and 'wide_n_deep' are supported. Default is 'wide_n_deep'.
hidden_layers: Units of hidden layers for the deep model. Tuple of positive int. Default is (40, 20, 10).

See here for the Python notebook that trains the WideAndDeep model on MovieLens 1M dataset and uses the model to do prediction and recommendation.

Neural network-based Collaborative Filtering

NCF (He, 2015) leverages a multi-layer perceptrons to learn the user–item interaction function. At the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NeuralCF model with or without matrix factorization.

After training the model, users can use the model to do prediction and recommendation.

Scala

val ncf = NeuralCF(userCount, itemCount, numClasses, userEmbed = 20, itemEmbed = 20, hiddenLayers = Array(40, 20, 10), includeMF = true, mfEmbed = 20)

userCount: The number of users. Positive integer.
itemCount: The number of items. Positive integer.
numClasses: The number of classes. Positive integer.
userEmbed: Units of user embedding. Positive integer. Default is 20.
itemEmbed: Units of item embedding. Positive integer. Default is 20.
hiddenLayers: Units hiddenLayers for MLP. Array of positive integers. Default is Array(40, 20, 10).
includeMF: Whether to include Matrix Factorization. Boolean. Default is true.
mfEmbed: Units of matrix factorization embedding. Positive integer. Default is 20.

See here for the Scala example that trains the NeuralCF model on MovieLens 1M dataset and uses the model to do prediction and recommendation.

Python

ncf = NeuralCF(user_count, item_count, class_num, user_embed=20, item_embed=20, hidden_layers=(40, 20, 10), include_mf=True, mf_embed=20)

user_count: The number of users. Positive int.
item_count: The number of classes. Positive int.
class_num: The number of classes. Positive int.
user_embed: Units of user embedding. Positive int. Default is 20.
item_embed: itemEmbed Units of item embedding. Positive int. Default is 20.
hidden_layers: Units of hidden layers for MLP. Tuple of positive int. Default is (40, 20, 10).
include_mf: Whether to include Matrix Factorization. Boolean. Default is True.
mf_embed: Units of matrix factorization embedding. Positive int. Default is 20.

See here for the Python notebook that trains the NeuralCF model on MovieLens 1M dataset and uses the model to do prediction and recommendation.

Session Recommender

Session Recommender (Hidasi, 2015) uses an RNN-based approach for session-based recommendations. The model is enhanced in NetEase (Wu, 2016) by adding multiple layers to model users' purchase history. In Analytics Zoo, includeHistory(Boolean) is provided for users to build a SessionRecommender model with or without history.

After training the model, users can use the model to do prediction and recommendation.

Scala

val sessionRecommender = SessionRecommender(itemCount, itemEmbed, sessionLength, includeHistory, mlpHiddenLayers, historyLength)

itemCountL: The number of distinct items. Positive integer.
itemEmbed: The output size of embedding layer. Positive integer.
mlpHiddenLayers: Units of hidden layers for the mlp model. Array of positive integers.
sessionLength: The max number of items in the sequence of a session
rnnHiddenLayers: Units of hidden layers for the mlp model. Array of positive integers.
includeHistory: Whether to include purchase history. Boolean. Default is true.
historyLength: The max number of items in the sequence of historical purchase

See here for the Scala example that trains the SessionRecommender model on an ecommerce dataset provided by OfficeDepot and uses the model to do prediction and recommendation.

Python

session_recommender=SessionRecommender(item_count, item_embed, rnn_hidden_layers=[40, 20], session_length=10, include_history=True, mlp_hidden_layers=[40, 20], history_length=5)

item_ount: The number of distinct items. Positive integer.
item_embed: The output size of embedding layer. Positive integer. *rnn_hidden_layers: Units of hidden layers for the mlp model. Array of positive integers.
session_length: The max number of items in the sequence of a session
include_history: Whether to include purchase history. Boolean. Default is true.
mlp_hidden_layers: Units of hidden layers for the mlp model. Array of positive integers.
history_length: The max number of items in the sequence of historical purchase

Prediction and Recommendation

Predict for user-item pairs

Give prediction for each pair of user and item. Return RDD of UserItemPrediction.

Scala

predictUserItemPair(featureRdd)

Python

predict_user_item_pair(feature_rdd)

Parameters:

featureRdd: RDD of UserItemFeature.

Recommend for users

Recommend a number of items for each user. Return RDD of UserItemPrediction. Only works for WND and NCF.

Scala

recommendForUser(featureRdd, maxItems)

Python

recommend_for_user(feature_rdd, max_items)

Parameters:

featureRdd: RDD of UserItemFeature.
maxItems: The number of items to be recommended to each user. Positive integer.

Recommend for items

Recommend a number of users for each item. Return RDD of UserItemPrediction. Only works for WND and NCF.

Scala

recommendForItem(featureRdd, maxUsers)

Python

recommend_for_item(feature_rdd, max_users)

Parameters:

featureRdd: RDD of UserItemFeature.
maxUsers: The number of users to be recommended to each item. Positive integer.

Recommend for sessions

Recommend a number of items for each sequence. Return corresponding recommendations, each of which contains a sequence of(item, probability). Only works for Session Recommender

Scala

recommendForSession(sessions, maxItems, zeroBasedLabel)

Python

recommend_for_session(sessions, max_items, zero_based_label)

Parameters:

sessions: RDD or Array of samples.
maxItems: Number of items to be recommended to each user. Positive integer.
zeroBasedLabel: True if data starts from 0, False if data starts from 1

Model Save

After building and training a WideAndDeep or NeuralCF model, you can save it for future use.

Scala

wideAndDeep.saveModel(path, weightPath = null, overWrite = false)

ncf.saveModel(path, weightPath = null, overWrite = false)

sessionRecommender.saveModel(path, weightPath = null, overWrite = false)

path: The path to save the model. Local file system, HDFS and Amazon S3 are supported. HDFS path should be like "hdfs://[host]:[port]/xxx". Amazon S3 path should be like "s3a://bucket/xxx".
weightPath: The path to save weights. Default is null.
overWrite: Whether to overwrite the file if it already exists. Default is false.

Python

wide_and_deep.save_model(path, weight_path=None, over_write=False)

ncf.save_model(path, weight_path=None, over_write=False)

session_recommender.save_model(path, weight_path=None, over_write=False)

path: The path to save the model. Local file system, HDFS and Amazon S3 are supported. HDFS path should be like 'hdfs://[host]:[port]/xxx'. Amazon S3 path should be like 's3a://bucket/xxx'.
weight_path: The path to save weights. Default is None.
over_write: Whether to overwrite the file if it already exists. Default is False.

Model Load

To load a WideAndDeep or NeuralCF model (with weights) saved above:

Scala

WideAndDeep.loadModel[Float](path, weightPath = null)

NeuralCF.loadModel[Float](path, weightPath = null)

SessionRecommender.loadModel[Float](path, weightPath = null)

path: The path for the pre-defined model. Local file system, HDFS and Amazon S3 are supported. HDFS path should be like "hdfs://[host]:[port]/xxx". Amazon S3 path should be like "s3a://bucket/xxx".
weightPath: The path for pre-trained weights if any. Default is null.

Python

WideAndDeep.load_model(path, weight_path=None)

NeuralCF.load_model(path, weight_path=None)

SessionRecommender.load_model(path, weight_path=None)

path: The path for the pre-defined model. Local file system, HDFS and Amazon S3 are supported. HDFS path should be like 'hdfs://[host]:[port]/xxx'. Amazon S3 path should be like 's3a://bucket/xxx'.
weight_path: The path for pre-trained weights if any. Default is None.

UserItemFeature

Represent records of user-item with features.

Each record should contain the following fields:

userId: Positive integer.
item_id: Positive integer.
sample: Sample which consists of feature(s) and label(s).

Scala

UserItemFeature(userId, itemId, sample)

Python

UserItemFeature(user_id, item_id, sample)

UserItemPrediction

Represent the prediction results of user-item pairs.

Each prediction record will contain the following information:

userId: Positive integer.
itemId: Positive integer.
prediction: The prediction (rating) for the user on the item.
probability: The probability for the prediction.

Scala

UserItemPrediction(userId, itemId, prediction, probability)

Python

UserItemPrediction(user_id, item_id, prediction, probability)

ColumnFeatureInfo

An instance of ColumnFeatureInfo contains the same data information shared by the WideAndDeep model and its feature generation part.

You can choose to include the following information for feature engineering and the WideAndDeep model:

wideBaseCols: Data of wideBaseCols together with wideCrossCols will be fed into the wide model.
wideBaseDims: Dimensions of wideBaseCols. The dimensions of the data in wideBaseCols should be within the range of wideBaseDims.
wideCrossCols: Data of wideCrossCols will be fed into the wide model.
wideCrossDims: Dimensions of wideCrossCols. The dimensions of the data in wideCrossCols should be within the range of wideCrossDims.
indicatorCols: Data of indicatorCols will be fed into the deep model as multi-hot vectors.
indicatorDims: Dimensions of indicatorCols. The dimensions of the data in indicatorCols should be within the range of indicatorDims.
embedCols: Data of embedCols will be fed into the deep model as embeddings.
embedInDims: Input dimension of the data in embedCols. The dimensions of the data in embedCols should be within the range of embedInDims.
embedOutDims: The dimensions of embeddings for embedCols.
continuousCols: Data of continuousCols will be treated as continuous values for the deep model.
label: The name of the 'label' column. String. Default is "label".

Remark:

Fields that involve Cols should be an array of String (Scala) or a list of String (Python) indicating the name of the columns in the data.

Fields that involve Dims should be an array of integers (Scala) or a list of integers (Python) indicating the dimensions of the corresponding columns.

If any field is not specified, it will by default to be an empty array (Scala) or an empty list (Python).

Scala

ColumnFeatureInfo(
    wideBaseCols = Array[String](),
    wideBaseDims = Array[Int](),
    wideCrossCols = Array[String](),
    wideCrossDims = Array[Int](),
    indicatorCols = Array[String](),
    indicatorDims = Array[Int](),
    embedCols = Array[String](),
    embedInDims = Array[Int](),
    embedOutDims = Array[Int](),
    continuousCols = Array[String](),
    label = "label")

Python

ColumnFeatureInfo(
    wide_base_cols=None,
    wide_base_dims=None,
    wide_cross_cols=None,
    wide_cross_dims=None,
    indicator_cols=None,
    indicator_dims=None,
    embed_cols=None,
    embed_in_dims=None,
    embed_out_dims=None,
    continuous_cols=None,
    label="label")