python - pyspark reduce key being a tuple values nested lists -

my problem following: parsing users interactions, each time interaction detected emit ((user1,user2),((date1,0),(0,1))). zero's here direction of interaction.

i cannot figure out why cannot reduce output following reduce function:

def myfunc2(x1,x2):     return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])

the output of mapper (flatmap(myfunc)) correct:

((7401899, 5678002), ((1403185440.0, 0), (1, 0)))
((82628194, 22251869), ((0, 1403185452.0), (0, 1)))
((2162276, 98056200), ((1403185451.0, 0), (1, 0)))
((0509420, 4827510), ((1403185449.0, 0), (1, 0)))
((7974923, 9235930), ((1403185450.0, 0), (1, 0)))
((250259, 6876774), ((0, 1403185450.0), (0, 1)))
((642369, 6876774), ((0, 1403185450.0), (0, 1)))
((82628194, 22251869), ((0, 1403185452.0), (0, 1)))
((2162276, 98056200), ((1403185451.0, 0), (1, 0)))

but running

lines.flatmap(myfunc) \               .map(lambda x: (x[0], x[1])) \               .reducebykey(myfunc2)

gives me error

return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])

typeerror: 'int' object has no attribute 'getitem'

i guess messing in keys don't know why (i tried recast key tuple said here same error)

some idea ? lot

okay, think problem here indexing deep in items don't go deep think.

let's examine myfunc2

def myfunc2(x1,x2):     return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])

given question above, input data this:

((467401899, 485678002), ((1403185440.0, 0), (1, 0)))

let's go ahead , assign data row equal variable.

x = ((467401899, 485678002), ((1403185440.0, 0), (1, 0)))

what happens when run x[0]? (467401899, 485678002). when run x[1]? ((1403185440.0, 0), (1, 0)). that's map statement doing, believe.

okay. that's clear.

in function myfunc2, have 2 parameters, x1 , x2. correspond variables above: x1 = x[0] = (467401899, 485678002) , x2 = x[1] = ((1403185440.0, 0), (1, 0))

now let's examine first part of return statement in function.

min(x1[0][0], x2[0][0])

so, x1 = (467401899, 485678002). cool. now, what's x1[0]? well, that's 467401899. obviously. wait! what's x1[0][0]? you're tryinig zeroth index of item @ x1[0], item @ x1[0] isn't list or tuple, it's int. , objects of <type 'int'> don't have method called getitem.

to summarize: you're digging deep objects not nested deeply. think passing myfunc2, , how deep objects are.

i think first part of return statement myfunc2 should like:

return min(x1[0], x2[0][0]). can index deeper on x2 because x2 has more nested tuples!

when run following, works fine:

a = sc.parallelize([((7401899, 5678002), ((1403185440.0, 0), (1, 0))), ((82628194, 22251869), ((0, 1403185452.0), (0, 1))), ((2162276, 98056200), ((1403185451.0, 0), (1, 0))), ((1509420, 4827510), ((1403185449.0, 0), (1, 0))), ((7974923, 9235930), ((1403185450.0, 0), (1, 0))), ((250259, 6876774), ((0, 1403185450.0), (0, 1))), ((642369, 6876774), ((0, 1403185450.0), (0, 1))), ((82628194, 22251869), ((0, 1403185452.0), (0, 1))), ((2162276, 98056200), ((1403185451.0, 0), (1, 0)))])  b = a.map(lambda x: (x[0], x[1])).reducebykey(myfunc2)  b.collect()  [((1509420, 4827510), ((1403185449.0, 0), (1, 0))),  ((2162276, 98056200), (1403185451.0, 1403185451.0, 0, 0, 2, 0)),  ((7974923, 9235930), ((1403185450.0, 0), (1, 0))),   ((7401899, 5678002), ((1403185440.0, 0), (1, 0))),   ((642369, 6876774), ((0, 1403185450.0), (0, 1))),   ((82628194, 22251869), (0, 0, 1403185452.0, 1403185452.0, 0, 2)),  ((250259, 6876774), ((0, 1403185450.0), (0, 1)))]

Search This Blog

Shell

python - pyspark reduce key being a tuple values nested lists -

Comments

Post a Comment