apache spark - How to get this using scala -
**df1** **df2** **output_df** 120 d 120 null 120 e b 120 null b 125 f c 120 null c d 120 d d e 120 e e f 120 null f g 120 null g h 120 null h 125 null 125 null b 125 null c 125 null d 125 null e 125 f f 125 null g 125 null h
from dataframe 1 , 2 need final output dataframe in spark-shell. a,b,c,d,e,f in date format(yyyy-mm-dd) & 120,125 ticket_id's column there thousands of ticket_id's. extracted 1 out of here.
full join of possible values, left join original dataframe:
import hivecontext.implicits._ val df1data = list((120, "d"), (120, "e"), (125, "f")) val df2data = list("a", "b", "c", "d", "e", "f", "g", "h") val df1 = sparkcontext.parallelize(df1data).todf("id", "date") val df2 = sparkcontext.parallelize(df2data).todf("date") // unique id: 120, 125 val uniqueiddf = df1.select(col("id")).distinct() val fulljoin = uniqueiddf.join(df2) val result = fulljoin.as("full").join(df1.as("df1"), col("full.id") === col("df1.id") && col("full.date") === col("df1.date"), "left_outer") val sorted = result.select(col("full.id"), col("df1.date"), col("full.date")).sort(col("full.id"), col("full.date")) sorted.show(false)
output:
+---+----+----+ |id |date|date| +---+----+----+ |120|null|a | |120|null|b | |120|null|c | |120|d |d | |120|e |e | |120|null|f | |120|null|g | |120|null|h | |125|null|a | |125|null|b | |125|null|c | |125|null|d | |125|null|e | |125|f |f | |125|null|g | |125|null|h | +---+----+----+
sorting here show same result, can skipped.
Comments
Post a Comment