TFDataset
TFDataset
TFDatset represents a distributed collection of elements to be feed into TensorFlow graph. TFDatasets can be created using a RDD and each of its records is a list of numpy.ndarray representing the tensors to be feed into TensorFlow graph on each iteration. TFDatasets must be used with the TFOptimizer or TFPredictor.
Remarks:
- You need to install tensorflow==1.15.0 on your driver node.
- Your operating system (OS) is required to be one of the following 64-bit systems: Ubuntu 16.04 or later and macOS 10.12.6 or later.
- To run on other systems, you need to manually compile the TensorFlow source code. Instructions can be found here.
Methods
from_rdd
Create a TFDataset from a rdd.
For training and evaluation, both features
and labels
arguments should be specified.
The element of the rdd should be a tuple of two, (features, labels), each has the
same structure of numpy.ndarrays of the argument features
, labels
.
E.g. if features
is [(tf.float32, [10]), (tf.float32, [20])],
and labels
is {"label1":(tf.float32, [10]), "label2": (tf.float32, [20])}
then a valid element of the rdd could be
(
[np.zeros(dtype=float, shape=(10,), np.zeros(dtype=float, shape=(10,)))],
{"label1": np.zeros(dtype=float, shape=(10,)),
"label2":np.zeros(dtype=float, shape=(10,))))}
)
If labels
is not specified,
then the above element should be changed to
[np.zeros(dtype=float, shape=(10,), np.zeros(dtype=float, shape=(10,)))]
For inference, labels
can be not specified.
The element of the rdd should be some ndarrays of the same structure of the features
argument.
A note on the legacy api: if you are using names
, shapes
, types
arguments,
each element of the rdd should be a list of numpy.ndarray.
Python
from_rdd(rdd, features, labels=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, val_rdd=None)
Arguments
- rdd: a rdd containing the numpy.ndarrays to be used for training/evaluation/inference
-
features: the structure of input features, should one the following:
- a tuple (dtype, shape), e.g. (tf.float32, [28, 28, 1])
- a list of such tuple [(dtype1, shape1), (dtype2, shape2)], e.g. [(tf.float32, [10]), (tf.float32, [20])],
- a dict of such tuple, mapping string names to tuple {"name": (dtype, shape}, e.g. {"input1":(tf.float32, [10]), "input2": (tf.float32, [20])}
-
labels: the structure of input labels, format is the same as features
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- val_rdd: validation data with the same structure of rdd
from_string_rdd
Create a TFDataset from a RDD of strings. Each element in the RDD should be a single string. The returning TFDataset's feature_tensors has only one Tensor. the type of the Tensor is tf.string, and the shape is (None,). The returning don't have label_tensors. If the dataset is used for training, the label should be encoded in the string.
Python
from_string_rdd(string_rdd, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_string_rdd=None)
Arguments
- string_rdd: the RDD of strings
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_string_rdd: the RDD of strings to be used in validation
from_bytes_rdd
Create a TFDataset from a RDD of bytes. Each element is the RDD should be a bytes object. The returning TFDataset's feature_tensors has only one Tensor. the type of the Tensor is tf.string, and the shape is (None,). The returning don't have label_tensors. If the dataset is used for training, the label should be encoded in the bytes.
Python
from_bytes_rdd(bytes_rdd, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_bytes_rdd=None)
Arguments
- bytes_rdd: the RDD of bytes
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_string_rdd: the RDD of bytes to be used in validation
from_ndarrays
Create a TFDataset from a nested structure of numpy ndarrays. Each element in the resulting TFDataset has the same structure of the argument tensors and is created by indexing on the first dimension of each ndarray in the tensors argument.
This method is equivalent to sc.parallize the tensors and call TFDataset.from_rdd
Python
from_ndarrays(tensors, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, val_tensors=None)
Arguments
- tensors: the numpy ndarrays
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- val_tensors: the numpy ndarrays used for validation during training
from_image_set
Create a TFDataset from a ImagetSet. Each ImageFeature in the ImageSet should already has the "sample" field, i.e. the result of ImageSetToSample transformer
Python
from_image_set(image_set, image, label=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_image_set=None)
Arguments
- image_set: the ImageSet used to create this TFDataset
- image: a tuple of two, the first element is the type of image, the second element is the shape of this element, i.e. (tf.float32, [224, 224, 3]))
- label: a tuple of two, the first element is the type of label, the second element is the shape of this element, i.e. (tf.int32, [1]))
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_image_set: the ImageSet used for validation during training
from_text_set
Create a TFDataset from a TextSet. The TextSet must be transformed to Sample, i.e. the result of TextFeatureToSample transformer.
Python
from_text_set(text_set, text, label=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_image_set=None)
Arguments
- text_set: the TextSet used to create this TFDataset
- text: a tuple of two, the first element is the type of this input feature, the second element is the shape of this element, i.e. (tf.float32, [10, 100, 4])). text can also be nested structure of this tuple of two.
- label: a tuple of two, the first element is the type of label, the second element is the shape of this element, i.e. (tf.int32, [1])). label can also be nested structure of this tuple of two.
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_image_set: The TextSet used for validation during training
from_feature_set
Create a TFDataset from a FeatureSet. Currently, the element in this Feature set must be a ImageFeature that has a sample field, i.e. the result of ImageSetToSample transformer
Python
from_feature_set(dataset, features, labels=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_dataset=None)
Arguments
- dataset: the feature set used to create this TFDataset
- features: a tuple of two, the first element is the type of this input feature, the second element is the shape of this element, i.e. (tf.float32, [224, 224, 3])). text can also be nested structure of this tuple of two.
- labels: a tuple of two, the first element is the type of label, the second element is the shape of this element, i.e. (tf.int32, [1])). label can also be nested structure of this tuple of two.
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_dataset: The FeatureSet used for validation during training
from_tf_data_dataset
Create a TFDataset from a tf.data.Dataset.
The recommended way to create the dataset is to reading files in a shared file system (e.g. HDFS) that is accessible from every executor of this Spark Application.
If the dataset is created by reading files in the local file system, then the files must exist in every executor in the exact same path. The path should be absolute path and relative path is not supported.
A few kinds of dataset is not supported for now: 1. dataset created from tf.data.Dataset.from_generators 2. dataset with Dataset.batch operation. 3. dataset with Dataset.repeat operation 4. dataset contains tf.py_func, tf.py_function or tf.numpy_function
Python
from_tf_data_dataset(dataset, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_dataset=None)
Arguments
- dataset: the tf.data.Dataset
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_dataset: the dataset used for validation
from_dataframe
Create a TFDataset from a pyspark.sql.DataFrame.
Python
from_dataframe(df, feature_cols, labels_cols=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_df=None)
Arguments
- df: the DataFrame for the dataset
- feature_cols: a list of string, indicating which columns are used as features. Currently supported types are FloatType, DoubleType, IntegerType, LongType, ArrayType (value should be numbers), DenseVector and SparseVector. For ArrayType, DenseVector and SparseVector, the element of the same column are assume to have the same size.
- label_cols: a list of string, indicating which columns are used as labels. Currently supported types are FloatType, DoubleType, IntegerType, LongType, ArrayType (value should be numbers), DenseVector and SparseVector. For ArrayType, DenseVector and SparseVector, the element of the same column are assume to have the same size.
- batch_size: the batch size, used for training, should be a multiple of total core num
- batch_per_thread: the batch size for each thread, used for inference or evaluation
- hard_code_batch_size: whether to hard code the batch_size into tensorflow graph, if True, the static size of the first dimension of the resulting tensors is batch_size/total_core_num (training) or batch_per_thread for inference; if False, it is None.
- validation_df: the DataFrame used for validation