Working with Texts
Analytics Zoo provides a series of text related APIs for end-to-end text processing pipeline, including text loading, pre-processing, training and inference, etc.
TextSet
TextSet
is a collection of TextFeatures where each TextFeature
keeps information of a single text record.
TextSet
can either be a DistributedTextSet
consisting of text RDD or a LocalTextSet
consisting of text array.
Read texts as TextSet
Read texts from a directory
Read texts with labels from a directory.
Under this specified directory path, there are supposed to be several subdirectories, each of which contains a number of text files belonging to this category. Each category will be a given a label (starting from 0) according to its position in the ascending order sorted among all subdirectories. Each text will be a given a label according to the directory where it is located.
Scala
textSet = TextSet.read(path, sc = null, minPartitions = 1)
path
: String. Folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from HDFS, sc needs to be specified.sc
: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.minPartitions
: Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
Python
text_set = TextSet.read(path, sc=None, min_partitions=1)
path
: String. Folder path to texts. Local or distributed file system (such as HDFS) are supported. If you want to read from HDFS, sc needs to be defined.sc
: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.min_partitions
: Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.
Read texts from csv file
Read texts with id from csv file.
Each record is supposed to contain id(String) and text(String) in order.
Note that the csv file should be without header.
Scala
textSet = TextSet.readCSV(path, sc = null, minPartitions = 1)
path
: String. The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from HDFS, sc needs to be specified.sc
: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is null and in this case texts will be read as a LocalTextSet.minPartitions
: Integer. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not null. Default is 1.
Python
text_set = TextSet.read_csv(path, sc=None, min_partitions=1)
path
: String. The path to the csv file. Local or distributed file system (such as HDFS) are supported. If you want to read from HDFS, sc needs to be defined.sc
: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet.min_partitions
: Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.
Read texts from parquet file
Read texts with id from parquet file with schema id(String) and text(String). Return a DistributedTextSet.
Scala
textSet = TextSet.readParquet(path, sqlContext)
path
: The path to the parquet file.sqlContext
: An instance of SQLContext.
Python
text_set = TextSet.read_parquet(path, sc)
path
: The path to the parquet file.sc
: An instance of SparkContext.
Build Text Transformation Pipeline
You can easily call transformation methods of a TextSet one by one to build the text transformation pipeline. Please refer to here for more details.
Scala Example
transformedTextSet = textSet.tokenize().normalize().word2idx().shapeSequence(len).generateSample()
Python Example
transformed_text_set = text_set.tokenize().normalize().word2idx().shape_sequence(len).generate_sample()
Text Training
After doing text transformation, you can directly feed the transformed TextSet into the model for training.
Scala
model.fit(transformedTextSet, batchSize, nbEpoch)
Python
model.fit(transformed_text_set, batch_size, nb_epoch)
Word Index Save and Load
Save word index
After training the model, you can save the word index correspondence to text file, which can be used for future inference. Each separate line will be "word id".
For LocalTextSet, save txt to a local file system.
For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).
Scala
transformedTextSet.saveWordIndex(path)
Python
transformed_text_set.save_word_index(path)
Load word index
During text prediction, you can load the saved word index back, so that the prediction TextSet uses exactly the same word index as the training process. Each separate line should be "word id".
For LocalTextSet, load txt to a local file system.
For DistributedTextSet, load txt to a local or distributed file system (such as HDFS).
Scala
textSet.loadWordIndex(path)
Python
text_set.load_word_index(path)
Text Prediction
Given a raw TextSet to do prediction, you need to first load the saved word index back as instructed above and go through the same transformation
process as what you did in your training. Note that here you do not need to specify any argument when calling word2idx
in the preprocessing pipeline as now you are using exactly
the loaded word index.
Then you can directly feed the transformed TextSet into the model for prediction and the prediction result will be stored in each TextFeature.
Scala
predictionTextSet = model.predict(transformedTextSet)
Python
prediction_text_set = model.predict(transformed_text_set)
Examples
You can refer to our TextClassification example for TextSet transformation, training and inference.
See here for the Scala example.
See here for the Python example.