At my job, I started work on a client project that was rather number intensive. We were going to have to perform repetitive calculations over a dataset. A colleague introduced me to the concept of data frames and the Ruby gem Daru.
What is Daru?
It’s a data analysis tool that lets us build tabular data sets which we can then manipulate and apply linear calculations. It’s also is a data visualization tool. It’s worth noting that Daru is quite similar in functionality to the pandas Python library.
You can do alot with Daru, but I’ll just be looking at a very small piece of it here. See Resources section below for a link to the docs.
Vectors and DataFrames
Vector
A Vector is a one dimensional set of data, like an array. When I was working with Daru, I liked to think of a Vector as a single column from a spreadsheet. In this example, I am mocking 7 day price history of some imaginary product. Note that the Vector can be named. Also, note the numeric index.
DataFrame
A DataFrame on the other hand is two dimensional, like a spreadsheet. Expanding
on the example above, we can include a date column. We instantiate the
DataFrame with a hash of arrays. Each key of the hash is a column name and
the array contains the data. This is one of several ways to compose a
DataFrame. Another way is to use Vectors instead of arrays - Vector
indicies are lined up with each other.
So I have a DataFrame…Now what?
We can perform analysis on the data frame like finding the mean, counts, min,
and max. Also we can find covariance and correlation bewtween Vectors. It’s
also possible to perform SQL like queries against the data. There is also
filtering and sorting…the list goes on and on. See the documentation for the details. Here, I’ll show a couple of examples that I found
useful.
Add a rolling mean column
Here is an example of adding a column to the DataFrame that is a rolling mean
calculation on the price column with a lookback of 7.
Do some arithmetic.
What if we want to calculate the price difference with a lookback of 7. Here is
one way we could solve it with the lag method.
Join DataFrames
Let’s say we want to compare the prices of two products. We could join two
DateFrames together à la SQL.
Conclusion
Daru is a powerful data analysis tool which I have barely scratched the surface.
Some other things of note:
Creating DataFrames from CSVs or Excel files
Grouping and aggregating data
Graceful handling of missing data (nils)
Pivot tables
Data visulization
There is much more to this library that what I have shown, so I encourage the
reader to explore more. I have provided some links below to get started.