dataframe - How to add column to exploded struct in Spark? -


say have following data:

{"id":1, "payload":[{"foo":1, "lol":2},{"foo":2, "lol":2}]} 

i explode payload , add column it, this:

df = df.select('id', f.explode('payload').alias('data')) df = df.withcolumn('data.bar', f.col('data.foo') * 2) 

however results in dataframe 3 columns:

  • id
  • data
  • data.bar

i expected data.bar part of data struct...

how can add column exploded struct, instead of adding top-level column?

df = df.withcolumn('data', f.struct(     df['data']['foo'].alias('foo'),    (df['data']['foo'] * 2).alias('bar') )) 

this result in:

root  |-- id: long (nullable = true)  |-- data: struct (nullable = false)  |    |-- col1: long (nullable = true)  |    |-- bar: long (nullable = true) 

update:

def func(x):     tmp = x.asdict()     tmp['foo'] = tmp.get('foo', 0) * 100     res = zip(*tmp.items())     return row(*res[0])(*res[1])  df = df.withcolumn('data', f.userdefinedfunction(func, structtype(     [structfield('foo', stringtype()), structfield('lol', stringtype())]))(df['data'])) 

p.s.

spark not support inplace opreation.

so every time want inplace, need replace actually.


Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -