Analytics Zoo provides a series of text related APIs for end-to-end text processing pipeline, including text loading, pre-processing, training and inference, etc.


TextSet is a collection of TextFeatures where each TextFeature keeps information of a single text record.

TextSet can either be a DistributedTextSet consisting of text RDD or a LocalTextSet consisting of text array.

Read texts as TextSet

Read texts from a directory

Read texts with labels from a directory.

Under this specified directory path, there are supposed to be several subdirectories, each of which contains a number of text files belonging to this category. Each category will be a given a label (starting from 0) according to its position in the ascending order sorted among all subdirectories. Each text will be a given a label according to the directory where it is located.


textSet = TextSet.read(path, sc = null, minPartitions = 1)


text_set = TextSet.read(path, sc=None, min_partitions=1)

Read texts from csv file

Read texts with id from csv file.

Each record is supposed to contain id(String) and text(String) in order.

Note that the csv file should be without header.


textSet = TextSet.readCSV(path, sc = null, minPartitions = 1)


text_set = TextSet.read_csv(path, sc=None, min_partitions=1)

Read texts from parquet file

Read texts with id from parquet file with schema id(String) and text(String). Return a DistributedTextSet.


textSet = TextSet.readParquet(path, sqlContext)


text_set = TextSet.read_parquet(path, sc)

TextSet Transformations

Analytics Zoo provides many transformation methods for a TextSet to form a text preprocessing pipeline, which will return the transformed TextSet that can be directly used for training and inference:


Do tokenization on original text.


transformedTextSet = textSet.tokenize()


transformed_text_set = text_set.tokenize()


Removes all dirty (non English alphabet) characters from tokens and converts words to lower case. Need to tokenize first.


transformedTextSet = textSet.normalize()


transformed_text_set = text_set.normalize()

Word To Index

Map word tokens to indices.

Important: Take care that this method behaves a bit differently for training and inference.


During the training, you need to generate a new word index correspondence according to the texts you are dealing with. Thus this method will first do the vocabulary generation and then convert words to indices based on the generated vocabulary.

The following arguments pose some constraints when generating the vocabulary. In the result vocabulary, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order.

Here we adopt the convention that index 0 will be reserved for unknown words. Need to tokenize first.

After word2idx, you can get the generated word index vocabulary by calling getWordIndex (Scala) or get_word_index() (Python) of the transformed TextSet.

Also, you can call saveWordIndex(path) (Scala) save_word_index(path) (Python) to save this word index vocabulary to be used in future training.


During the inference, you are supposed to use exactly the same word index correspondence as in the training stage instead of generating a new one. Need to tokenize first.

Thus please be aware that you do not need to specify any of the below arguments.

You need to call loadWordIndex(path) (Scala) or load_word_index(path) (Python) beforehand for word index loading.


transformedTextSet = textSet.word2idx(removeTopN = 0, maxWordsNum = -1, minFreq = 1, existingMap = null)


transformed_text_set = text_set.word2idx(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)

Sequence Shaping

Shape the sequence of indices to a fixed length. Need to word2idx first.


transformedTextSet = textSet.shapeSequence(len, truncMode = TruncMode.pre, padElement = 0)


transformed_text_set = text_set.shape_sequence(len, trunc_mode="pre", pad_element=0)

BigDL Sample Generation

Transform indices and label (if any) to a BigDL Sample. Need to word2idx first.


transformedTextSet = textSet.generateSample()


transformed_text_set = text_set.generate_sample()


This is a special Embedding layer that directly loads pre-trained word vectors as weights, which turns non-negative integers (indices) into dense vectors of fixed size.

Currently only GloVe embedding is supported for this layer.

The input of this layer should be 2D.


embedding = WordEmbedding(embeddingFile, wordIndex = null, trainable = false, inputLength = -1)


embedding = WordEmbedding(embedding_file, word_index=None, trainable=False, input_length=None)