airflow - Can a workflow with many, many subdags be performant? -
i have workflow involves many instances of subdagoperator, tasks generated in loop. pattern illustrated following toy dag file:
from datetime import datetime, timedelta airflow.models import dag airflow.operators.dummy_operator import dummyoperator airflow.operators.subdag_operator import subdagoperator dag = dag( 'subdaggy-2', schedule_interval=none, start_date=datetime(2017,1,1) ) def make_sub_dag(parent_dag, n): dag = dag( '%s.task_%d' % (parent_dag.dag_id, n), schedule_interval=parent_dag.schedule_interval, start_date=parent_dag.start_date ) dummyoperator(task_id='task1', dag=dag) >> dummyoperator(task_id='task2', dag=dag) return dag downstream_task = dummyoperator(task_id='downstream', dag=dag) n in range(20): subdagoperator( dag=dag, task_id='task_%d' % n, subdag=make_sub_dag(dag, n) ) >> downstream_task
i find convenient way organize tasks, particularly since helps keep top-level dag uncluttered, if subdag contains more tasks (i.e. tens, not couple.)
the problem is, approach doesn't scale number of subdags (20 in example) increases. find when total number of dag objects created in overall workflow surpasses 200 (which can happen production workflow, if pattern occurs several times) things grind halt.
so question: there way organize tasks way (many similar subdags) scales hundreds or thousands of subdags? profiling suggests process spends lot of time in dag object constructor. perhaps there way avoid instantiating new dag object each of subdagoperators?
well, seems @ least significant part of problem indeed due cost of dag constructor, in turn in large part due cost of inspect.stack()
. put simple patch replace cheaper method , there indeed seem improvement - flow several thousand subdags failed load me loads. we'll see if goes anywhere.
Comments
Post a Comment