FAQ and Known Issues

Py4JJavaError: An error occurred while calling z:org.apache.spark.bigdl.api.python.BigDLSerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

you may need to check whether your input argument involves Numpy types (such as numpy.int64). See here for the related issue.

For example, invoking np.min, np.max, np.unique, etc. will return type numpy.int64. One way to solve this is to use int() to convert a number of type numpy.int64 to a Python int.

For example´╝îone of our customer change NCF recommender's preprocessing to match their data:

df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("xxx.csv") # The header is user, item, label. The type is string, string, int.
user_indexer = StringIndexer(inputCol='user',outputCol='user_index',handleInvalid="skip")
item_indexer = StringIndexer(inputCol='item',outputCol='item_index',handleInvalid="skip")
pipe = Pipeline(stages=[user_indexer, item_indexer])
pipe_fit = pip.fit(df)
df = pipe_fit.transform(df)
train_data = df.select('user_index','item_index','label')

Then they use a map to transform this train_data to RDD[Sample]:

def build_sample(user_id, item_id, rating):
    sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))
    return UserItemFeature(user_id, item_id, sample)
pairFeatureRdds = train_data.rdd.map(lambda x: build_sample(x[0], x[1], x[2]-1))

If they execute a pairFeatureRdds.count(), this counting job will cost 12 hours when the dataset has 6,000,000 records.
It seems a bug of StringIndexer. But we find a good way to work around this, before transform user_index, item_index and label to UserItemFeature we need to cast the Double indexs to Float, like this:

train_data = train_data.withColumn("user_index", df_r_new["user_index"].cast(FloatType()))
train_data = train_data.withColumn("item_index", df_r_new["item_index"].cast(FloatType()))

Then the job finish in about 30s, the GC is also disappeared.