tensorflow - Dataset distributed tensor flow iterator -


iterator local variable doesn't share index across multiple devices , neither restore checkpoint. useful in using large input data share index across multiple devices , restore index checkpoint. there work-around using tf.contrib.data.dataser multi-device environment or did missing feature of dataset apis?

here test code snippet.

def pipeline():     dataset1 = tf.contrib.data.dataset.range(100)#.shuffle(10).repeat()     dataset2 = tf.contrib.data.dataset.range(100)#.shuffle(10).repeat()     dataset = tf.contrib.data.dataset.zip((dataset1, dataset2))#.batch(10)     iterator = dataset.make_initializable_iterator(shared_name='shuffled_idx')     return iterator  test_ops = [] device, target = device_and_target() tf.device(device):      global_step = tf.variable(tf.constant(0), trainable=false, name='global_step')     apply_updates = state_ops.assign_add(global_step, 1).op     test_ops.append(apply_updates)      iterator = pipeline()     get_next = iterator.get_next()     test_ops.append(get_next)  tf.train.monitoredtrainingsession(         master=target,         is_chief=is_chief,         #hooks=hooks,         checkpoint_dir=flags.train_dir         ) sess:      if debug_mode:         sess = tf_debug.localclidebugwrappersession(sess._sess._sess._sess._sess)         sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)      sess.run(iterator.initializer)      in range(5):         _, (next1, next2) = sess.run(test_ops)         print(next1, next2)         print('global step : %d' % sess.run(global_step))         time.sleep(1) 

i ran 2 devices twice. input data iterator same since iterator doesn't share index , starts top of queue since not saved chekpoint.

here output of device0 first execution

(0, 0) global step : 2 (1, 1) global step : 4 (2, 2) global step : 6 (3, 3) global step : 8 (4, 4) global step : 10 

the output of device1 first execution

(0, 0) global step : 1 (1, 1) global step : 3 (2, 2) global step : 5 (3, 3) global step : 7 (4, 4) global step : 9 

the output of device0 2nd execution.

(0, 0) global step : 12 (1, 1) global step : 14 (2, 2) global step : 16 (3, 3) global step : 18 (4, 4) global step : 20 

the output of device1 2nd execution.

(0, 0) global step : 11 (1, 1) global step : 13 (2, 2) global step : 15 (3, 3) global step : 17 (4, 4) global step : 19 


Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -