list - python select lines based on maximum value of a column -

- July 15, 2010

i'm not familiar python, there's need do. have ascii file (space-separated) of several columns. in first column, values duplicates. these duplicate values, need select lines have larger value in 3rd column, example, , return array back. i'd this:

#col1    col2    col3    col4    col5 1         1       2       3       4 1         2       1       5       3 2         2       5       2       1

would return lines 1 , 3. here's have far: defined auxiliary function detect indexes of duplicates (all second entries)

def list_duplicates(seq):     seen = set()     seen_add = seen.add     return [idx idx,item in enumerate(seq) if item in seen or seen_add(item)]

and try use read list (that loaded file np.genfromtxt naming each column)

def select_high(ndarray, dup_col, sel_col): #dup_col column duplicates are, sel_col column select larger value     result = []     dup = list_duplicates(ndarray[dup_col])     dupdup = [x-1 x in dup]     in range(len(ndarray[sel_col])):                 if in dup:             mid = []             maxi = max(ndarray[sel_col][i], ndarray[sel_col][i-1])             maxi_index = np.where(ndarray[sel_col] == maxi)[0][0]             name in ndarray.dtype.names:                 mid.append(ndarray[name][maxi_index])             result.append(mid)         else:             mid = []             if not in dupdup:                 name in ndarray.dtype.names:                     mid.append(ndarray[name][i])             result.append(mid)      return np.asarray(result)

but what's happening whenever there duplicates have remove else part or gives me error, , whenever there no duplicates have put back. appreciated, sorry long post , hope made myself clear

i think lost in details (and me too). here version want, more simple:

m = [[1, 2, 1, 5, 3], [1, 1, 2, 3, 4], [2, 2, 5, 2, 1]] s = sorted(m,  key=lambda r:(r[0], -r[2])) print(s)  seen = set() print( [r r in s if r[0] not in seen , not seen.add(r[0])])

the first line defines m list of rows file.

the second line sorts rows on value in first column (r[0]), on value in third column, larger smaller value (-r[2]):

s=[[1, 1, 2, 3, 4], [1, 2, 1, 5, 3], [2, 2, 5, 2, 1]]

now need skip rows when have seen value in first column @ least once. use set seento store r[0] values have seen. if r[0] not in seen, should keep row , put in seen, in such way discard row next time see r[0]. that's little tricky:

if r[0] not in seen , not seen.add(r[0])

note not seen.add(r[0]) true, because seen.add returns none. thus:

if r[0] not in seen, put r[0] in seen , keep row
if r[0] in seen, return false , discard row.

you express too:

if not (r[0] in seen or seen.add(r[0]))

Search This Blog

ANy

list - python select lines based on maximum value of a column -

Comments

Post a Comment

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

ios - MKAnnotationView layer is not of expected type: MKLayer -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -