TFDataset


TFDataset

TFDatset represents a distributed collection of elements to be feed into TensorFlow graph. TFDatasets can be created using a RDD and each of its records is a list of numpy.ndarray representing the tensors to be feed into TensorFlow graph on each iteration. TFDatasets must be used with the TFOptimizer or TFPredictor.

Remarks:

Methods

from_rdd

Create a TFDataset from a rdd.

For training and evaluation, both features and labels arguments should be specified. The element of the rdd should be a tuple of two, (features, labels), each has the same structure of numpy.ndarrays of the argument features, labels.

E.g. if features is [(tf.float32, [10]), (tf.float32, [20])], and labels is {"label1":(tf.float32, [10]), "label2": (tf.float32, [20])} then a valid element of the rdd could be

    (
    [np.zeros(dtype=float, shape=(10,), np.zeros(dtype=float, shape=(10,)))],
     {"label1": np.zeros(dtype=float, shape=(10,)),
      "label2":np.zeros(dtype=float, shape=(10,))))}
    )

If labels is not specified, then the above element should be changed to

    [np.zeros(dtype=float, shape=(10,), np.zeros(dtype=float, shape=(10,)))]

For inference, labels can be not specified. The element of the rdd should be some ndarrays of the same structure of the features argument.

A note on the legacy api: if you are using names, shapes, types arguments, each element of the rdd should be a list of numpy.ndarray.

Python

from_rdd(rdd, features, labels=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, val_rdd=None)

Arguments

from_string_rdd

Create a TFDataset from a RDD of strings. Each element in the RDD should be a single string. The returning TFDataset's feature_tensors has only one Tensor. the type of the Tensor is tf.string, and the shape is (None,). The returning don't have label_tensors. If the dataset is used for training, the label should be encoded in the string.

Python

from_string_rdd(string_rdd, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_string_rdd=None)

Arguments

from_bytes_rdd

Create a TFDataset from a RDD of bytes. Each element is the RDD should be a bytes object. The returning TFDataset's feature_tensors has only one Tensor. the type of the Tensor is tf.string, and the shape is (None,). The returning don't have label_tensors. If the dataset is used for training, the label should be encoded in the bytes.

Python

from_bytes_rdd(bytes_rdd, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_bytes_rdd=None)

Arguments

from_ndarrays

Create a TFDataset from a nested structure of numpy ndarrays. Each element in the resulting TFDataset has the same structure of the argument tensors and is created by indexing on the first dimension of each ndarray in the tensors argument.

This method is equivalent to sc.parallize the tensors and call TFDataset.from_rdd

Python

from_ndarrays(tensors, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, val_tensors=None)

Arguments

from_image_set

Create a TFDataset from a ImagetSet. Each ImageFeature in the ImageSet should already has the "sample" field, i.e. the result of ImageSetToSample transformer

Python

from_image_set(image_set, image, label=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_image_set=None)

Arguments

from_text_set

Create a TFDataset from a TextSet. The TextSet must be transformed to Sample, i.e. the result of TextFeatureToSample transformer.

Python

from_text_set(text_set, text, label=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_image_set=None)

Arguments

from_feature_set

Create a TFDataset from a FeatureSet. Currently, the element in this Feature set must be a ImageFeature that has a sample field, i.e. the result of ImageSetToSample transformer

Python

from_feature_set(dataset, features, labels=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_dataset=None)

Arguments

from_tf_data_dataset

Create a TFDataset from a tf.data.Dataset.

The recommended way to create the dataset is to reading files in a shared file system (e.g. HDFS) that is accessible from every executor of this Spark Application.

If the dataset is created by reading files in the local file system, then the files must exist in every executor in the exact same path. The path should be absolute path and relative path is not supported.

A few kinds of dataset is not supported for now: 1. dataset created from tf.data.Dataset.from_generators 2. dataset with Dataset.batch operation. 3. dataset with Dataset.repeat operation 4. dataset contains tf.py_func, tf.py_function or tf.numpy_function

Python

from_tf_data_dataset(dataset, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_dataset=None)

Arguments

from_dataframe

Create a TFDataset from a pyspark.sql.DataFrame.

Python

from_dataframe(df, feature_cols, labels_cols=None, batch_size=-1, batch_per_thread=-1, hard_code_batch_size=False, validation_df=None)

Arguments