What is the best way to take np.percentile along an axis ignoring nans?

Is there a reasonably fast way to do np.percentile(ndarr, axis=0) on data containing NaN values?

For np.median, there is the corresponding bottleneck.nanmedian (https://pypi.python.org/pypi/Bottleneck) that is pretty good.

The best I've come up with for percentile, which is incomplete and presently incorrect, is:

from bottleneck import nanrankdata, nanmax, nanargmin
def nanpercentile(x, q, axis):
ranks = nanrankdata(x, axis=axis)
peak = nanmax(ranks, axis=axis)
pct = ranks/peak / 100. # to make a percentile
wh = nanargmin(abs(pct-q),axis=axis)
return x[wh]

This doesn't work; really what is needed is some way to take the n'th element along the axis, but I haven't found the numpy slicing trick to do that.

"Reasonably fast" means better than looping over indices, e.g.:

q = 40
x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
out = np.empty(x.shape[:-1])
for i in range(x.shape):
for j in range(x.shape):
d = x[i,j,:]
out[i,j] = np.percentile(d[np.isfinite(d)], q)

print out

#array([[ 1.8,  4.8],
#       [ 0.9,  5.4]])

which works but can be exceedingly slow.

np.ma appears not to work as expected; it treats the nan value as if it were inf:

print np.percentile(xm,40,axis=2)

# array([[ 1.8,  5.6],
#        [ 0.9,  7.8]])

评论

np.nanpercentile is included

np.nanpercentile is included in numpy 1.9.0

http://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html

You can manipulate the strid

You can manipulate the strides of the array to iterate over it faster, using as_strided() which is found in numpy.lib.stride_tricks.

Your computations can be viewed as operating on (1,1,3) windows of your array. I like to use a generalized function (sliding_window() that creates n by n windows using as_strided(). I found it here - Efficient Overlapping Windows with Numpy; credit for the function apparently goes to johnvinyard. That blog page is a pretty good description of what is happening.

Make some 1x1x3 windows

import numpy as np
x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
for thing in sliding_window(x, (1,1,3)):
print thing

# [ 1.  2.  3.]
# [  6.  nan   4.]
# [ 0.5  2.   1. ]
# [  9.   3.  nan]

Apply ```np.percentile()'' - disregarding the NaN's

for thing in sliding_window(x, (1,1,3)):
print np.percentile(thing[np.isfinite(thing)], 40)

# 1.8
# 4.8
# 0.9
# 5.4

Make an array of the result:

per_s = [np.percentile(thing[np.isfinite(thing)], 40)
for thing in sliding_window(x, (1,1,3))]

print per_s
# [1.8, 4.8000000000000007, 0.90000000000000002, 5.4000000000000004]

per_s = np.array(per_s)
print per_s
# array([ 1.8,  4.8,  0.9,  5.4])

Get it back to the shape you expect

print per_s.reshape((2,2))
# array([[ 1.8,  4.8],
#        [ 0.9,  5.4]])

print per_s.reshape(x.shape[:-1])
# array([[ 1.8,  4.8],
#        [ 0.9,  5.4]])

This should be faster. I'm curious if it will be - i don't have any real world problems to test it on.

A google search of numpy as_strided turns up some good results: I have this one bookmarked, http://scipy-lectures.github.io/advanced/advanced_numpy/

sliding_window() from Efficient Overlapping Windows with Numpy

from numpy.lib.stride_tricks import as_strided as ast
from itertools import product

def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.

Parameters
shape - an int, or a tuple of ints

Returns
a shape tuple
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass

try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass

raise TypeError('shape must be an int, or a tuple of ints')

def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions

Parameters:
a  - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.

Returns
an array containing each n-dimensional window from a
'''

if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)

# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)

# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
raise ValueError(\
'a.shape, ws and ss must all have the same length. They were %s' % str(ls))

# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
raise ValueError('ws cannot be larger than a in any dimension. a.shape was %s and ws was %s' % (str(a.shape),str(ws)))

# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided

# Collapse strided so that it has one more dimension than the window.  I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
#dim = filter(lambda i : i != 1,dim)
dim = tuple(thing for thing in dim if thing != 1)
return strided.reshape(dim)

If you don't need super fast

If you don't need super fast solution, you could first transfer your array to pandas DataFrame and do quantile and then get back to numpy array.

df = pd.DataFrame(array.T).quantile()
arr = np.array(df)

you can use partition() in n

you can use partition() in numpy 1.8 to take the n'th element along the axis, here is the code to get the second elements along the last axis:

x = np.array([[[1,2,3],[6,np.nan,4]],[[0.5,2,1],[9,3,np.nan]]])
np.partition(x, 1)[..., 1]

the output:

array([[ 2.,  6.],
[ 1.,  9.]])