Run

You need to first install analytics-zoo, either from pip or without pip.

NOTE: We have tested on Python 3.6 and Python 3.7. Support for Python 2.7 has been removed due to its end of life.

Run after pip install

Important:

Installing analytics-zoo from pip will automatically install pyspark. To avoid possible conflicts, you are highly recommended to unset SPARK_HOME if it exists in your environment.
Please always first call init_nncontext() at the very beginning of your code after pip install. This will create a SparkContext with optimized performance configuration and initialize the BigDL engine.

from zoo.common.nncontext import *
sc = init_nncontext()

Use an Interactive Shell

Type python in the command line to start a REPL.
Try to run the example code to verify the installation.

Use Jupyter Notebook

Start jupyter notebook as you normally do, e.g.

jupyter notebook --notebook-dir=./ --ip=* --no-browser

Try to run the example code to verify the installation.

Configurations

Increase memory

export SPARK_DRIVER_MEMORY=20g

Add extra jars or python packages

Set the environment variables BIGDL_JARS and BIGDL_PACKAGES BEFORE creating SparkContext:

export BIGDL_JARS=...
export BIGDL_PACKAGES=...

Run on Yarn after pip install

You should use init_spark_on_yarn rather than init_nncontext() here to create a SparkContext on Yarn.

Start python and then execute the following code:

from zoo import init_spark_on_yarn

sc = init_spark_on_yarn(
    hadoop_conf="path to the yarn configuration folder",
    conda_name="zoo", # The name of the created conda-env
    num_executors=2,
    executor_cores=4,
    executor_memory="8g",
    driver_memory="2g",
    driver_cores=4,
    extra_executor_memory_for_ray="10g")

Run without pip install

Note that Python 3.6 is only compatible with Spark 1.6.4, 2.0.3, 2.1.1 and >=2.2.0. See this issue for more discussion.

Set SPARK_HOME and ANALYTICS_ZOO_HOME

If you download Analytics Zoo from the Release Page:

export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the path where you extract the analytics-zoo package

If you build Analytics Zoo by yourself:

export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the dist directory of Analytics Zoo

Update spark-analytics-zoo.conf (Optional)

If you have some customized properties in some files, which will be used with the --properties-file option in spark-submit/pyspark, you can add these customized properties into ${ANALYTICS_ZOO_HOME}/conf/spark-analytics-zoo.conf.

Run with pyspark

${ANALYTICS_ZOO_HOME}/bin/pyspark-shell-with-zoo.sh --master local[*]

--master set the master URL to connect to
--jars if there are extra jars needed.
--py-files if there are extra python packages needed.

You can also specify other options available for pyspark in the above command if needed.

Try to run the example code for verification.

Run with spark-submit

An Analytics Zoo Python program runs as a standard pyspark program, which requires all Python dependencies (e.g., numpy) used by the program to be installed on each node in the Spark cluster. You can try running the Analytics Zoo Object Detection Python example as follows:

${ANALTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh --master local[*] predict.py model_path image_path output_path

Run with Jupyter Notebook

With the full Python API support in Analytics Zoo, users can use our package together with powerful notebooks (such as Jupyter Notebook) in a distributed fashion across the cluster, combining Python libraries, Spark SQL/DataFrames and MLlib, deep learning models in Analytics Zoo, as well as interactive visualization tools.

Prerequisites: Install all the necessary libraries on the local node where you will run Jupyter, e.g.,

sudo apt install python
sudo apt install python-pip
sudo pip install numpy scipy pandas scikit-learn matplotlib seaborn wordcloud

Launch the Jupyter Notebook as follows:

${ANALYTICS_ZOO_HOME}/bin/jupyter-with-zoo.sh --master local[*]

--master set the master URL to connect to
--jars if there are extra jars needed.
--py-files if there are extra python packages needed.

You can also specify other options available for pyspark in the above command if needed.

After successfully launching Jupyter, you will be able to navigate to the notebook dashboard using your browser. You can find the exact URL in the console output when you started Jupyter; by default, the dashboard URL is http://your_node:8888/

Try to run the example code for verification.

Run with conda environment on Yarn

If you have already created Analytics Zoo dependency conda environment package according to Yarn cluster guide here, you can run Python programs using Analytics Zoo using the following command.

Here we use Analytics Zoo Object Detection Python example for illustration.

Yarn cluster mode (with conda package name "environment.tar.gz" for example)

export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package
export ENV_HOME=the parent directory of your conda environment package

PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
    --master yarn-cluster \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --archives ${ENV_HOME}/environment.tar.gz#environment \
    predict.py model_path image_path output_path

Yarn client mode (with conda package name "environment.tar.gz" for example)

export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package
export ENV_HOME=the parent directory of your conda environment package

mkdir ${ENV_HOME}/environment
tar -xzf ${ENV_HOME}/environment.tar.gz -C ${ENV_HOME}/environment

PYSPARK_DRIVER_PYTHON=${ENV_HOME}/environment/bin/python PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
    --master yarn \
    --deploy-mode client \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 16 \
    --num-executors 2 \
    --archives ${ENV_HOME}/environment.tar.gz#environment \
    predict.py model_path image_path output_path

Example code

To verify if Analytics Zoo can run successfully, run the following simple code:

import zoo
from zoo.common.nncontext import *
from zoo.pipeline.api.keras.models import *
from zoo.pipeline.api.keras.layers import *

# Get the current Analytics Zoo version
zoo.__version__
# Create a SparkContext and initialize the BigDL engine.
sc = init_nncontext()
# Create a Sequential model containing a Dense layer.
model = Sequential()
model.add(Dense(8, input_shape=(10, )))