There is a data set format called TFRecord recommended by TensorFlow.
When you can master TFRecord, you may be able to learn large-scale data efficiently.
In this article, I will explain how to read and write so that I can master how to use TFRecord and actually implement it using QueueRunner.
Reasons to use TFRecord
The contents of TFRecord is a binary format called Protocol Buffer. Creating TFRecord once may reduce the cost of generating and processing data. You can also use it as an input data format for Cloud ML Engine by using the TFRecord format.
When machine learning is done with TensorFlow, there are the following methods to read the training data set.
(1) Load all the data in memory beforehand
(2) Read it little by little with Python code and input it
to graph with feed_dict (3) Read from TFRecord using Threading and Queues on the graph
(4) Dataset API to use
(1) is effective when the data set is small. If you load the file into memory only once, you can input it to the graph quickly. However, if the memory is compressed when the data grows, processing speed may decrease or memory allocation error may occur.
(2) is also a good idea because it saves you the trouble of creating TFRecord many times if you want to implement it simply as a prototype. However, when operating in single thread, there are cases where data reading and learning are made synchronous, so the overall learning time may be long. Also, when changing machine learning model or tuning, it may be necessary to repeat the same process many times. If you are running similar data processing every time, consider creating TFRecord.
When using TFRecord, you are going to enter TensorFlow’s calculation graph by the method of (3) or (4). As multi-threaded queues are used on the computation graph, learning and reading and processing of data sets can be performed asynchronously.
How to make TF Record
Let’s create TF Record immediately. Let ‘s learn how to make TFRecord by taking Fashion MNIST as an example this time . Fashion MNIST is a data set that classifies 10 types of clothing images of 28 × 28 as follows.
Since there is a link on the page of Fashion MNIST,
data/fashioncreate a directory and save it.
$ mkdir -p data/fashin $ cd data/fashion $ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz $ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz $ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz $ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz $ cd ../..By doing this, you can read the code from TensorFlow like MNIST.
from tensorflow.examples.tutorials.mnist import input_data fashion_mnist = input_data.read_data_sets('data/fashion')
Example record and SequenceExample record
tf.train.SequenceExamplewill write as a unit of one record.
tf.train.ExampleTreats fixed-length lists such as numbers and images. The value of each record is
tf.train.FeatureThe following data types are available:
The following way
[value]to specify the value in the list.
tf.train.SequenceExampleIs a data format with fixed length context and variable length feature_lists.
tf.train.SequenceExampleLet’s use when learning sequential data such as text and time series .
Try to make Fashion MNIST TFRecord
Try saving Fashion MNIST in TFRecord format. For numpy arrays like
tobytes()this you can convert the list to Bytes format by using methods.
When this code is executed,
fashion_mnist_test.tfrecordwill be stored in the current directory .
How to check the contents of TFRecord
tf.train.Example.FromStringis convenient if you want to know the structure in TFRecord you exported in the past .
Some of the features
widthyou will find that the feature of is on.
How to read TFRecord
tf.parse_single_examplecan be loaded using. Please be aware
BytesListthat those written with
I actually implement it
Try to actually execute it with data before TFRecord and TFRecord. In the case of Fashion MNIST, since the data volume is not so large, it enters the whole data memory, but the input part to the calculation graph should be asynchronous.
In this example, there is no merit because pre-processing and IO with the database do not occur and all can be expanded in memory.
Consider setting TFRecord when data I / O occurs in real time with a huge data set, distributed learning with multiple machines, etc.
It can also be written briefly by using the Dataset API. Since we have introduced it before, please refer also here.