TensorFlow data format | TFRecord write and read method

There is a data set format called TFRecord recommended by TensorFlow.

When you can master TFRecord, you may be able to learn large-scale data efficiently.

In this article, I will explain how to read and write so that I can master how to use TFRecord and actually implement it using QueueRunner.

Reasons to use TFRecord

The contents of TFRecord is a binary format called Protocol Buffer. Creating TFRecord once may reduce the cost of generating and processing data. You can also use it as an input data format for Cloud ML Engine by using the TFRecord format.

When machine learning is done with TensorFlow, there are the following methods to read the training data set.

(1) Load all the data in memory beforehand
(2) Read it little by little with Python code and input it
to graph with feed_dict (3) Read from TFRecord using Threading and Queues  on the graph
(4) Dataset API to use

(1) is effective when the data set is small. If you load the file into memory only once, you can input it to the graph quickly. However, if the memory is compressed when the data grows, processing speed may decrease or memory allocation error may occur.

(2) is also a good idea because it saves you the trouble of creating TFRecord many times if you want to implement it simply as a prototype. However, when operating in single thread, there are cases where data reading and learning are made synchronous, so the overall learning time may be long. Also, when changing machine learning model or tuning, it may be necessary to repeat the same process many times. If you are running similar data processing every time, consider creating TFRecord.

When using TFRecord, you are going to enter TensorFlow’s calculation graph by the method of (3) or (4). As multi-threaded queues are used on the computation graph, learning and reading and processing of data sets can be performed asynchronously.

How to make TF Record

Let’s create TF Record immediately. Let ‘s learn how to make TFRecord by taking Fashion MNIST  as an example this time . Fashion MNIST is a data set that classifies 10 types of clothing images of 28 × 28 as follows.

Since there is a link on the page of Fashion MNIST, data/fashioncreate a directory and save it.

$ mkdir -p data/fashin
$ cd data/fashion
$ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
$ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
$ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
$ wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
$ cd ../.. By doing this, you can read the code from TensorFlow like MNIST.
from tensorflow.examples.tutorials.mnist import input_data
fashion_mnist = input_data.read_data_sets('data/fashion')

Example record and SequenceExample record

TFRecord, the tf.train.Examplecity tf.train.SequenceExamplewill write as a unit of one record. tf.train.ExampleTreats fixed-length lists such as numbers and images. The value of each record is tf.train.Featurespecified by. tf.train.FeatureThe following data types are available:

  • tf.train.Int64List
  • tf.train.FloatList
  • tf.train.BytesList

The following way [value]to specify the value in the list.

tf.train.Example(features=tf.train.Features(feature={
    'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[height])),
    'width' : tf.train.Feature(int64_list=tf.train.Int64List(value=[width])),
    'depth' : tf.train.Feature(int64_list=tf.train.Int64List(value=[depth])),
    'image' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[image]))
}))

tf.train.SequenceExampleIs a data format with fixed length context and variable length feature_lists. tf.train.SequenceExampleLet’s use when learning sequential data such as text and time series .

example = tf.train.SequenceExample()
# 固定長の値はcontext経由
example.context.feature["length"].int64_list.value.append(len(data))

# 可変長のデータはfeature_lists経由で指定
words_list = example.feature_lists.feature_list["words"]
for word in words:
    words_list.feature.add().int64_list.value.append(word_id(word))

Try to make Fashion MNIST TFRecord

Try saving Fashion MNIST in TFRecord format. For numpy arrays like tobytes()this you can convert the list to Bytes format by using methods.

import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

def make_example(image, label):
    return tf.train.Example(features=tf.train.Features(feature={
        'image' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
        'label' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[label]))
    }))

def write_tfrecord(images, labels, filename):
    writer = tf.python_io.TFRecordWriter(filename)
    for image, label in zip(images, labels):
        labels = labels.astype(np.float32)
        ex = make_example(image.tobytes(), label.tobytes())
        writer.write(ex.SerializeToString())
    writer.close()

def main():
    fashion_mnist = input_data.read_data_sets('data/fashion', one_hot=True)
    train_images  = fashion_mnist.train.images
    train_labels  = fashion_mnist.train.labels
    test_images   = fashion_mnist.test.images
    test_labels   = fashion_mnist.test.labels
    write_tfrecord(train_images, train_labels, 'fashion_mnist_train.tfrecord')
    write_tfrecord(test_images, test_labels, 'fashion_mnist_test.tfrecord')

if __name__ == '__main__':
    main()

When this code is executed, fashion_mnist_train.tfrecordand fashion_mnist_test.tfrecordwill be stored in the current directory .

How to check the contents of TFRecord

It tf.train.Example.FromStringis convenient if you want to know the structure in TFRecord you exported in the past .

In [1]: import tensorflow as tf

In [2]: example = next(tf.python_io.tf_record_iterator("fashion_mnist_train.tfrecord"))

In [3]: tf.train.Example.FromString(example)
Out[3]:
features {
  feature {
  feature {
    key: "image"
    value {
      bytes_list {
        value: "\000...\000"
      }
    }
  }
  feature {
    key: "label"
    value {
      bytes_list {
        value: "\000...\000"
      }
    }
  }
}

Some of the features imagedoor labeldoor heightdoor widthyou will find that the feature of is on.

How to read TFRecord

TFRecord tf.parse_single_examplecan be loaded using. Please be aware BytesListthat those written with tf.string.

def read_tfrecord(filename):
    filename_queue = tf.train.string_input_producer([filename])
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    features = tf.parse_single_example(
        serialized_example,
        features={
            'image': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.string)
        })

    image = tf.decode_raw(features['image'], tf.float32)
    label = tf.decode_raw(features['label'], tf.float64)

    image = tf.reshape(image, [28, 28, 1])
    label = tf.reshape(label, [10])

    image, label = tf.train.batch([image, label],
            batch_size=16,
            capacity=500)

    return image, label

I actually implement it

Try to actually execute it with data before TFRecord and TFRecord. In the case of Fashion MNIST, since the data volume is not so large, it enters the whole data memory, but the input part to the calculation graph should be asynchronous.

Use TFRecord

import numpy as np
import tensorflow as tf
import tfrecord_io
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.contrib import slim

def model(image, label):
    net = slim.conv2d(image, 48, [5,5], scope='conv1')
    net = slim.max_pool2d(net, [2,2], scope='pool1')
    net = slim.conv2d(net, 96, [5,5], scope='conv2')
    net = slim.max_pool2d(net, [2,2], scope='pool2')
    net = slim.flatten(net, scope='flatten')
    net = slim.fully_connected(net, 512, scope='fully_connected1')
    logits = slim.fully_connected(net, 10,
            activation_fn=None, scope='fully_connected2')

    prob = slim.softmax(logits)
    loss = slim.losses.softmax_cross_entropy(logits, label)

    train_op = slim.optimize_loss(loss, slim.get_global_step(),
            learning_rate=0.001,
            optimizer='Adam')

    return train_op

def main():
    train_images, train_labels = tfrecord_io.read_tfrecord('fashion_mnist_train.tfrecord')
    train_op = model(train_images, train_labels)

    step = 0
    with tf.Session() as sess:
        init_op = tf.group(
            tf.local_variables_initializer(),
            tf.global_variables_initializer())
        sess.run(init_op)
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)
        while step < 3000:
            sess.run([train_op])

            if step % 100 == 0:
                print('step: {}'.format(step))

            step += 1

        coord.request_stop()
        coord.join(threads)

if __name__ == '__main__':
    main()

Summary

In this example, there is no merit because pre-processing and IO with the database do not occur and all can be expanded in memory.

Consider setting TFRecord when data I / O occurs in real time with a huge data set, distributed learning with multiple machines, etc.

It can also be written briefly by using the Dataset API. Since we have introduced it before, please refer also here.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.