list - python select lines based on maximum value of a column -
i'm not familiar python, there's need do. have ascii file (space-separated) of several columns. in first column, values duplicates. these duplicate values, need select lines have larger value in 3rd column, example, , return array back. i'd this:
#col1 col2 col3 col4 col5 1 1 2 3 4 1 2 1 5 3 2 2 5 2 1
would return lines 1 , 3. here's have far: defined auxiliary function detect indexes of duplicates (all second entries)
def list_duplicates(seq): seen = set() seen_add = seen.add return [idx idx,item in enumerate(seq) if item in seen or seen_add(item)]
and try use read list (that loaded file np.genfromtxt naming each column)
def select_high(ndarray, dup_col, sel_col): #dup_col column duplicates are, sel_col column select larger value result = [] dup = list_duplicates(ndarray[dup_col]) dupdup = [x-1 x in dup] in range(len(ndarray[sel_col])): if in dup: mid = [] maxi = max(ndarray[sel_col][i], ndarray[sel_col][i-1]) maxi_index = np.where(ndarray[sel_col] == maxi)[0][0] name in ndarray.dtype.names: mid.append(ndarray[name][maxi_index]) result.append(mid) else: mid = [] if not in dupdup: name in ndarray.dtype.names: mid.append(ndarray[name][i]) result.append(mid) return np.asarray(result)
but what's happening whenever there duplicates have remove else
part or gives me error, , whenever there no duplicates have put back. appreciated, sorry long post , hope made myself clear
i think lost in details (and me too). here version want, more simple:
m = [[1, 2, 1, 5, 3], [1, 1, 2, 3, 4], [2, 2, 5, 2, 1]] s = sorted(m, key=lambda r:(r[0], -r[2])) print(s) seen = set() print( [r r in s if r[0] not in seen , not seen.add(r[0])])
the first line defines m
list of rows file.
the second line sorts rows on value in first column (r[0]
), on value in third column, larger smaller value (-r[2]
):
s=[[1, 1, 2, 3, 4], [1, 2, 1, 5, 3], [2, 2, 5, 2, 1]]
now need skip rows when have seen value in first column @ least once. use set seen
to store r[0]
values have seen. if r[0]
not in seen
, should keep row , put in seen
, in such way discard row next time see r[0]
. that's little tricky:
if r[0] not in seen , not seen.add(r[0])
note not seen.add(r[0])
true, because seen.add
returns none
. thus:
if
r[0]
not inseen
, putr[0]
inseen
, keep rowif
r[0]
inseen
, return false , discard row.
you express too:
if not (r[0] in seen or seen.add(r[0]))
Comments
Post a Comment