Run
You need to first install analytics-zoo, either from pip or without pip.
NOTE: We have tested on Python 3.6 and Python 3.7. Support for Python 2.7 has been removed due to its end of life.
Run after pip install
Important:
-
Installing analytics-zoo from pip will automatically install
pyspark
. To avoid possible conflicts, you are highly recommended to unsetSPARK_HOME
if it exists in your environment. -
Please always first call
init_nncontext()
at the very beginning of your code after pip install. This will create a SparkContext with optimized performance configuration and initialize the BigDL engine.
from zoo.common.nncontext import *
sc = init_nncontext()
Use an Interactive Shell
- Type
python
in the command line to start a REPL. - Try to run the example code to verify the installation.
Use Jupyter Notebook
- Start jupyter notebook as you normally do, e.g.
jupyter notebook --notebook-dir=./ --ip=* --no-browser
- Try to run the example code to verify the installation.
Configurations
- Increase memory
export SPARK_DRIVER_MEMORY=20g
- Add extra jars or python packages
Set the environment variables BIGDL_JARS
and BIGDL_PACKAGES
BEFORE creating SparkContext
:
export BIGDL_JARS=...
export BIGDL_PACKAGES=...
Run on Yarn after pip install
You should use init_spark_on_yarn
rather than init_nncontext()
here to create a SparkContext on Yarn.
Start python and then execute the following code:
from zoo import init_spark_on_yarn
sc = init_spark_on_yarn(
hadoop_conf="path to the yarn configuration folder",
conda_name="zoo", # The name of the created conda-env
num_executors=2,
executor_cores=4,
executor_memory="8g",
driver_memory="2g",
driver_cores=4,
extra_executor_memory_for_ray="10g")
Run without pip install
- Note that Python 3.6 is only compatible with Spark 1.6.4, 2.0.3, 2.1.1 and >=2.2.0. See this issue for more discussion.
Set SPARK_HOME and ANALYTICS_ZOO_HOME
- If you download Analytics Zoo from the Release Page:
export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the path where you extract the analytics-zoo package
- If you build Analytics Zoo by yourself:
export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the dist directory of Analytics Zoo
Update spark-analytics-zoo.conf (Optional)
If you have some customized properties in some files, which will be used with the --properties-file
option
in spark-submit/pyspark
, you can add these customized properties into ${ANALYTICS_ZOO_HOME}/conf/spark-analytics-zoo.conf.
Run with pyspark
${ANALYTICS_ZOO_HOME}/bin/pyspark-shell-with-zoo.sh --master local[*]
--master
set the master URL to connect to--jars
if there are extra jars needed.--py-files
if there are extra python packages needed.
You can also specify other options available for pyspark in the above command if needed.
Try to run the example code for verification.
Run with spark-submit
An Analytics Zoo Python program runs as a standard pyspark program, which requires all Python dependencies (e.g., numpy) used by the program to be installed on each node in the Spark cluster. You can try running the Analytics Zoo Object Detection Python example as follows:
${ANALTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh --master local[*] predict.py model_path image_path output_path
Run with Jupyter Notebook
With the full Python API support in Analytics Zoo, users can use our package together with powerful notebooks (such as Jupyter Notebook) in a distributed fashion across the cluster, combining Python libraries, Spark SQL/DataFrames and MLlib, deep learning models in Analytics Zoo, as well as interactive visualization tools.
Prerequisites: Install all the necessary libraries on the local node where you will run Jupyter, e.g.,
sudo apt install python
sudo apt install python-pip
sudo pip install numpy scipy pandas scikit-learn matplotlib seaborn wordcloud
Launch the Jupyter Notebook as follows:
${ANALYTICS_ZOO_HOME}/bin/jupyter-with-zoo.sh --master local[*]
--master
set the master URL to connect to--jars
if there are extra jars needed.--py-files
if there are extra python packages needed.
You can also specify other options available for pyspark in the above command if needed.
After successfully launching Jupyter, you will be able to navigate to the notebook dashboard using your browser. You can find the exact URL in the console output when you started Jupyter; by default, the dashboard URL is http://your_node:8888/
Try to run the example code for verification.
Run with conda environment on Yarn
If you have already created Analytics Zoo dependency conda environment package according to Yarn cluster guide here, you can run Python programs using Analytics Zoo using the following command.
Here we use Analytics Zoo Object Detection Python example for illustration.
- Yarn cluster mode (with conda package name "environment.tar.gz" for example)
export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package
export ENV_HOME=the parent directory of your conda environment package
PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn-cluster \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 8 \
--num-executors 2 \
--archives ${ENV_HOME}/environment.tar.gz#environment \
predict.py model_path image_path output_path
- Yarn client mode (with conda package name "environment.tar.gz" for example)
export SPARK_HOME=the root directory of Spark
export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package
export ENV_HOME=the parent directory of your conda environment package
mkdir ${ENV_HOME}/environment
tar -xzf ${ENV_HOME}/environment.tar.gz -C ${ENV_HOME}/environment
PYSPARK_DRIVER_PYTHON=${ENV_HOME}/environment/bin/python PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
--master yarn \
--deploy-mode client \
--executor-memory 10g \
--driver-memory 10g \
--executor-cores 16 \
--num-executors 2 \
--archives ${ENV_HOME}/environment.tar.gz#environment \
predict.py model_path image_path output_path
Example code
To verify if Analytics Zoo can run successfully, run the following simple code:
import zoo
from zoo.common.nncontext import *
from zoo.pipeline.api.keras.models import *
from zoo.pipeline.api.keras.layers import *
# Get the current Analytics Zoo version
zoo.__version__
# Create a SparkContext and initialize the BigDL engine.
sc = init_nncontext()
# Create a Sequential model containing a Dense layer.
model = Sequential()
model.add(Dense(8, input_shape=(10, )))