tensorflow - Dataset distributed tensor flow iterator -
iterator local variable doesn't share index across multiple devices , neither restore checkpoint. useful in using large input data share index across multiple devices , restore index checkpoint. there work-around using tf.contrib.data.dataser multi-device environment or did missing feature of dataset apis?
here test code snippet.
def pipeline(): dataset1 = tf.contrib.data.dataset.range(100)#.shuffle(10).repeat() dataset2 = tf.contrib.data.dataset.range(100)#.shuffle(10).repeat() dataset = tf.contrib.data.dataset.zip((dataset1, dataset2))#.batch(10) iterator = dataset.make_initializable_iterator(shared_name='shuffled_idx') return iterator test_ops = [] device, target = device_and_target() tf.device(device): global_step = tf.variable(tf.constant(0), trainable=false, name='global_step') apply_updates = state_ops.assign_add(global_step, 1).op test_ops.append(apply_updates) iterator = pipeline() get_next = iterator.get_next() test_ops.append(get_next) tf.train.monitoredtrainingsession( master=target, is_chief=is_chief, #hooks=hooks, checkpoint_dir=flags.train_dir ) sess: if debug_mode: sess = tf_debug.localclidebugwrappersession(sess._sess._sess._sess._sess) sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan) sess.run(iterator.initializer) in range(5): _, (next1, next2) = sess.run(test_ops) print(next1, next2) print('global step : %d' % sess.run(global_step)) time.sleep(1)
i ran 2 devices twice. input data iterator same since iterator doesn't share index , starts top of queue since not saved chekpoint.
here output of device0 first execution
(0, 0) global step : 2 (1, 1) global step : 4 (2, 2) global step : 6 (3, 3) global step : 8 (4, 4) global step : 10
the output of device1 first execution
(0, 0) global step : 1 (1, 1) global step : 3 (2, 2) global step : 5 (3, 3) global step : 7 (4, 4) global step : 9
the output of device0 2nd execution.
(0, 0) global step : 12 (1, 1) global step : 14 (2, 2) global step : 16 (3, 3) global step : 18 (4, 4) global step : 20
the output of device1 2nd execution.
(0, 0) global step : 11 (1, 1) global step : 13 (2, 2) global step : 15 (3, 3) global step : 17 (4, 4) global step : 19
Comments
Post a Comment