# What the quantile really is?

**Publish date:**Aug 23, 2019

## Intro

For few years when I was practicing data science I've used quantiles a lot. I knew what it is and when one should use it. I knew mathematical definition and intuition around quantiles but if someone would ask me to implement it from scratch I wouldn't know how to do it right away.

So what is a quantile? It's a one of most commonly used descriptive statistic. It gives us aggregated information about sample's distribution. For example let's say we have data sample representing salary of employees of some company. Quantile can gives us answer on question "How much should I be paid to earn more then 90% of employees?". Let's assume we have a sample

\[ S = \lbrace 140, 80, 70, 200, 100 \rbrace \]

How would you answer the question for this sample? Probably by sorting salaries

\[ 70, 80, 100, 140, 200 \]

and the answer would be somewhere between \(140\) and \(200\). But where?

## A bit of maths

Formally if \(X\) is random variable with distribution \(P\) and \(p \in \lbrack
0, 1 \rbrack\), then **p-quantile** is value \(x_p\) of X which satisfy the
following two inequalities

\[ P(X \le x_p) \ge p \\ P(X \ge x_p) \ge 1 - p \]

But in real-world (statistics) we often don't know the real form of CDF of given sample. More interesting question is "how quantile is calculated?". At first it might not be obvious how one can implement quantile calculation just by looking at those formulas.

## How it's calculated?

Statistics gives us an *estimator* for calculating quantile. This
estimator also use order statistics
of a sample. General form of p-quantile estimator \(\hat{Q}(p)\) which is used
in statistical software is as follows

\[ \hat{Q}(p) = (1 - \gamma(p))X_{(j)} + \gamma(p) X_{(j+1)} \]

where \(X_{(j)}\) is j-th order statistics of sample \(x\), \(0 \le \gamma(p) \le 1\) and

\[ \frac{j - m}{n} \le p \le \frac{j - m - 1}{n} \]

where \(n\) is sample size.

**But where exactly?** As it turns out there is not a good answer. Obviously we
need some kind of interpolation with property \(Q(p) \ge Q(q)\) if \(p \ge q\).
This interpolation is mainly needed for small samples. If in our case sample had
size of \(100\) then \(0.90\) quantile would be 90-th sorted element of the
sample but in our five elements sample we cannot give specific answer without
interpolation.
In paper *Sample Quantiles in Statistical
Packages*
by Rob T. Hyndman and Yanan Fan there is summary of nine methods of interpolation
defined in various statistical software.

Here we focus only on one specific method - default method from **R**
programming language, from function `quantile`

- `type = 7`

. In this case
\(j = \lfloor p(n - 1) + 1 \rfloor\) and function \(\gamma(p)\) is of the form

\[ \gamma(p) = p(n - 1) + 1 - \lfloor p(n - 1) + 1 \rfloor \]

Now we can calculate 0.90-quantile of our five-element salaries sample - \(n = 5\) and \(p = 0.9\). As we thought the answer lay somewhere between \(140\) and \(200\) now it is confirmed by above formulas - \(j = \lfloor 0.9 \cdot 4 + 1 \rfloor = \lfloor 4.6 \rfloor = 4\). Furthermore \( \gamma(0.9) = 4.6 - 4 = 0.6\). Finally we can compute

\[ \hat{Q}(0.9) = (1 - \gamma(0.9))X_{(4)} + \gamma(0.9)X_{(5)} = \\ (1-0.6) \cdot 140 + 0.6 \cdot 200 = 176.0 \]

So using default R method for calculating quantile we should expected final result as \(176\).

`quantile(x = c(70, 80, 100, 140, 200), probs = 0.90)`

Figure `1`

presents how function \(\gamma(p)\) interpolates between sample points.
One can observe that interpolation is linear between nodes. It seems like
reasonable method of interpolation.

## Go implementation

Origin of this post came from little task I gave to myself - implementing in go
bootstrap test for mean equality from paper *An Introduction To The Bootstrap*
by Bradley Efron and Robert Tibshirani - revolutionary paper in modern
statistics. For this implementation I've needed quantile. There is
`gonum`

library in go for mathematical and
statistical functionalities where one can find implemented quantile function.
At that point I asked myself a question "what the quantile really is?". I'm
aware it's better to use already existing implementation but it was nice
exercise, especially for someone with statistics or data science background.

So I've recreated quantile calculation in go with the same type of interpolation that is used as default type in R. It can be found here.

## Summary

In my opinion quantile is one of most frequently used statistic from set of descriptive statistics. A sequence of quantiles of a sample gives us quality information about its distribution. I hope now, at the end of this post, you have an idea how this statistic is calculated in details. In particular it is worth to remember that calculated quantile value may doesn't exists in the sample. Also in sample with outlier (e.g. \(\lbrace 0.5, 1.25, 5.32, 3.54, 1321.50 \rbrace\)) p-quantiles for p close to \(1.0\) would be artificial value. It's calculated as interpolation between one sample point and the outlier. In case when value of calculated quantile is "weird" now you know why.

If your software uses some external library to calculates quantiles make sure to know which type of interpolation this implementation is using. In some corner cases it might be helpful.