Data Analysis in RUby

At my job, I started work on a client project that was rather number intensive. We were going to have to perform repetitive calculations over a dataset. A colleague introduced me to the concept of data frames and the Ruby gem Daru.

### What is Daru?

It’s a data analysis tool that lets us build tabular data sets which we can then manipulate and apply linear calculations. It’s also is a data visualization tool. It’s worth noting that Daru is quite similar in functionality to the pandas Python library.

You can do alot with Daru, but I’ll just be looking at a very small piece of it here. See Resources section below for a link to the docs.

### Vectors and DataFrames

#### Vector

A Vector is a one dimensional set of data, like an array. When I was working with Daru, I liked to think of a Vector as a single column from a spreadsheet. In this example, I am mocking 7 day price history of some imaginary product. Note that the Vector can be named. Also, note the numeric index.

#### DataFrame

A DataFrame on the other hand is two dimensional, like a spreadsheet. Expanding on the example above, we can include a date column. We instantiate the DataFrame with a hash of arrays. Each key of the hash is a column name and the array contains the data. This is one of several ways to compose a DataFrame. Another way is to use Vectors instead of arrays - Vector indicies are lined up with each other.

### So I have a DataFrame…Now what?

We can perform analysis on the data frame like finding the mean, counts, min, and max. Also we can find covariance and correlation bewtween Vectors. It’s also possible to perform SQL like queries against the data. There is also filtering and sorting…the list goes on and on. See the documentation for the details. Here, I’ll show a couple of examples that I found useful.

#### Add a rolling mean column

Here is an example of adding a column to the DataFrame that is a rolling mean calculation on the price column with a lookback of 7.

#### Do some arithmetic.

What if we want to calculate the price difference with a lookback of 7. Here is one way we could solve it with the `lag` method.

#### Join DataFrames

Let’s say we want to compare the prices of two products. We could join two DateFrames together à la SQL.

### Conclusion

Daru is a powerful data analysis tool which I have barely scratched the surface. Some other things of note:

• Creating DataFrames from CSVs or Excel files
• Grouping and aggregating data
• Graceful handling of missing data (nils)
• Pivot tables
• Data visulization

There is much more to this library that what I have shown, so I encourage the reader to explore more. I have provided some links below to get started.