Categories
Uncategorized

Data science trick #1: Don’t roll your own basic statistics functions — use these libraries instead!

With data science and machine learning a hot topic these days (and possibly a path to the hottest job at the moment), you may be experimenting with statistics, and in doing so, you may be rolling your own statistics methods. Don’t!

You wouldn’t chop down trees for lumber for a home renovation project; you’d go to Home Depot or a lumber store and get standard cuts of wood. In the same vein, you should make use of ready-made statistics libraries, which are proven, road-tested, and let you focus on what your application actually does.

These are the ones I use:

JavaScript: jStat

If you’re doing stats in JavaScript, you want jStat, which provides not just the basic statistical functions, but all manner of distributions, including Weibull (β), Cauchy, Poisson, hypergeometric, and beta distributions, with probability density functions (pdf), cumulative density functions (cdf), inverse, mean, mode, variance, and a sample function, allowing for more complex calculations.

jStat is contained in a single file: jstat.js; there’s also the minified version, jstat.min.js.

You can also get the most up-to-date version from jsdelivr’s content delivery netowork at http://cdn.jsdelivr.net/npm/jstat@latest/dist/jstat.min.js

To install it via npm, just do this on the command line…

npm install --save jStat

…and if you’re loading it while in Node, reference the child object. Here’s a session in Node:

var jStat = require('jStat').jStat

// Now we can use jStat!
let data = [20, 30, 30, 40, 40, 40]
jStat.mean(data)   // 33.333333333333336
jStat.median(data) // 35
jStat.mode(data)   // 40
jStat.stdev(data)  // 7.453559924999299

Python: Python’s statistics library

For more in-depth statistics functions, you’ll want to go with Scipy, but for the basics — namely averages and measures of central location (mean, mode, median, and so on) and calculating spread (variance and standard deviation) — you might just want to use Python’s statistics library, which was introduced with Python 3.4.

To use it, import it first, and then you’re good to go! Here’s a session in the Python REPL:

import statistics

// Now we can use statistics!
data = [20, 30, 30, 40, 40, 40]
statistics.mean(data)   # 33.333333333333336
statistics.median(data) # 35.0
statistics.mode(data)   # 40
statistics.stdev(data)  # 8.16496580927726

# Wait, why’s this different from the JavaScript result?
# That’s because in Python’s statistics library,
# stdev() is the *sample* standard deviation,
# while jStat’s stdev() is the *population* standard deviation.
# To get the population standard deviation in Python’s statistics,
# use pstdev().
statistics.pstdev(data) # 7.453559924999299

Swift: SigmaSwiftStatistics

iOS, MacOS, WatchOS, tvOS, and server-side Swift developers can add statistical goodness to their projects with SigmaSwiftStatistics.

You can add it to your project in a number of ways:

  1. Including the SigmaDistrib.swift file into your project.
  2. Using Carthage.
  3. Using CocoaPods.
  4. Using Swift Package Manager.

Here it is in action, in a Swift playground:

let data: [Double] = [20, 30, 30, 40, 40, 40]
Sigma.average(data) // 33.333333333333336
Sigma.median(data)  // 35

// Oddly enough, there’s no mode function.

Sigma.standardDeviationPopulation(data) // 7.453559924999299
Sigma.standardDeviationSample(data)     // 8.164965809277259

Kotlin: Kotlin Statistics

If Kotlin’s your jam and you want to do stats, you want Kotlin Statistics.

I use Kotlin primarily in Android Studio, so I use Gradle to include it:

dependencies {
    compile 'org.nield:kotlin-statistics:1.0.0'
}

Here it is, inside an Android app written in Kotlin:

val data = sequenceOf(20, 30, 30, 40, 40, 40)
val average = data.average() // 33.333333333333336
val median = data.median()   // 35.0
val mode = data.mode()       // 40
val standardDeviation = data.standardDeviation() // 7.453559924999299