Time Series Forecasting

Automated Time Series Prediction

Training a model using TimeSequencePredictor

TimeSequencePredictor can be used to train a model on historical time sequence data and predict future sequences. Note that:
* We require input time series data to be uniformly sampled in timeline. Missing data points will lead to errors or unreliable prediction result.

0. Prepare environment

We recommend you to use Anaconda to prepare the environments, especially if you want to run automated training on a yarn cluster (yarn-client mode only).

conda create -n zoo python=3.7 #zoo is conda enviroment name, you can set another name you like.
conda activate zoo
pip install analytics-zoo[automl]==0.9.0.dev0 # or above

1. Before training, init RayOnSpark.

from zoo import init_spark_on_local
from zoo.ray import RayContext
sc = init_spark_on_local(cores=4)
ray_ctx = RayContext(sc=sc)
from zoo import init_spark_on_yarn
from zoo.ray import RayContext
slave_num = 2
sc = init_spark_on_yarn(
        executor_memory="8g ",
ray_ctx = RayContext(sc=sc, object_store_memory="5g")

2. Create a TimeSequencePredictor

from zoo.chronos.regression.time_sequence_predictor import TimeSequencePredictor

tsp = TimeSequencePredictor(dt_col="datetime", target_col="value", extra_features_col=None,

3. Train on historical time sequence.

datetime value
2019-06-06 1.2
2019-06-07 2.3
pipeline = tsp.fit(train_df, metric="mean_squared_error", recipe=RandomRecipe(num_samples=1))

4. After training finished, stop RayOnSpark


Saving and Loading a TimeSequencePipeline

from zoo.chronos.pipeline.time_sequence import load_ts_pipeline

pipeline = load_ts_pipeline("/tmp/saved_pipeline/my.ppl")

Prediction and Evaluation using TimeSequencePipeline

A TimeSequencePipeline contains a chain of feature transformers and models, which does end-to-end time sequence prediction on input data. TimeSequencePipeline can be saved and loaded for future deployment.

Output dataframe look likes below (assume predict n values forward). col datetime is the starting timestamp.

datetime value_0 value_1 ... value_{n-1}
2019-06-06 1.2 2.8 ... 4.4
result_df = pipeline.predict(test_df)
#evaluate with MSE and R2 metrics
mse, rs = pipeline.evaluate(test_df, metrics=["mse", "rs"])
#fit with new data and train for 5 epochs