An Advanced Data Analytics Library for the Java Virtual Machine

Fully open source and released under the Apache Software License, Morpheus enables the development of high performance analytical applications involving large datasets for both offline and real-time analysis on the JVM.

Introduction

The Morpheus library is designed to facilitate the development of high performance analytical software involving large datasets for both offline and real-time analysis on the Java Virtual Machine (JVM). The library is written in Java 8 with extensive use of lambdas, but is accessible to all JVM languages.

Motivation

At its core, Morpheus provides a versatile two-dimensional memory efficient tabular data structure called a DataFrame, similar to that first popularized in R. Whiled dynamically typed scientific computing languages like R, Python & Matlab are great for doing research, they are not well suited for large scale production systems as they become extremely difficult to maintain, and dangerous to refactor. The Morpheus library attempts to retain the power and versatility of the DataFrame concept, while providing a much more type safe and self describing collection of interfaces, which should make developing, maintaining & scaling code complexity much easier.

Another advantage of the Morpheus library is that it is extremely good at scaling on multi-core processor architectures given the powerful threading capabilities of the JavaVirtual Machine. Many operations on a Morpheus DataFrame can seamlessly be run in parallel by simply calling parallel() on the entity you wish to operate on, much like with Java 8 Streams.Internally, these parallel implementations are based on the Fork & Join framework, and near linear improvements in performance are observed for certain types of operations as CPU cores are added.

Capabilities

A Morpheus DataFrame is a column store structure where each column is represented by a Morpheus Array of which there are many implementations, including dense, sparse and memory mapped versions. Morpheus arrays are optimized and wherever possible are backed by primitive native Java arrays (even for types such as LocalDate, LocalDateTime etc...)as these are far more efficient from a storage, access and garbage collection perspective. Memory mapped Morpheus Arrays, while still experimental, allow very large DataFrames to be created using off-heap storage that are backed by files.

While the complete feature set of the Morpheus DataFrame is still evolving, there are already many powerful APIs to affect complex transformations and analytical operations with ease. There are standard functions to compute summary statistics, perform various types of Linear Regressions, apply Principal Component Analysis(PCA) to mention just a few. The DataFrame is indexed in both the row and column dimension, allowing data to be efficiently sorted, sliced, grouped, and aggregated along either axis.

Morpheus at a Glance

A Simple Example

Consider a dataset of motor vehicle characteristics accessible here.The code below loads this CSV data into a Morpheus DataFrame, filters the rows to only include those vehicles that have a power to weight ratio > 0.1 (where weight is converted into kilograms), then adds a column to record the relative efficiency between highway and city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed result to a CSV file.


var url = "https://d3x.s3.amazonaws.com/data/public/morpheus/cars93.csv";
DataFrame.read(url).csv(Integer.class, options -> {
    options.setExcludeColumnIndexes(0);
    options.setHeader(true);
}).rows().select(row -> {
    var weightKG = row.getDouble("Weight") * 0.453592d;
    var horsepower = row.getDouble("Horsepower");
    return horsepower / weightKG > 0.1d;
}).cols().add("MPG(Highway/City)", Double.class, v -> {
    var cityMpg = v.row().getDouble("MPG.city");
    var highwayMpg = v.row().getDouble("MPG.highway");
    return highwayMpg / cityMpg;
}).rows().sort(false, "MPG(Highway/City)").write().csv(options -> {
    options.setFile("/Users/witdxav/cars93m.csv");
    options.setTitle("DataFrame");
});

This example demonstrates the functional nature of the Morpheus API, where many method return types are in fact a DataFrame and therefore allow this form of method chaining. In this example, the methods csv(), select(), add(), and sort() all return a frame. In some cases the same frame that the method operates on, or in other cases a filter or shallow copy of the frame being operated on. The first 10 rows of the transformed dataset in this example looks as follows, with the newly added column appearing on the far right of the data frame.

 Index  |  Manufacturer  |     Model      |   Type    |  Min.Price  |   Price   |  Max.Price  |  MPG.city  |  MPG.highway  |       AirBags        |  DriveTrain  |  Cylinders  |  EngineSize  |  Horsepower  |  RPM   |  Rev.per.mile  |  Man.trans.avail  |  Fuel.tank.capacity  |  Passengers  |  Length  |  Wheelbase  |  Width  |  Turn.circle  |  Rear.seat.room  |  Luggage.room  |  Weight  |  Origin   |           Make            |  MPG(Highway/City)  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     9  |      Cadillac  |       DeVille  |    Large  |    33.0000  |  34.7000  |    36.3000  |        16  |           25  |         Driver only  |       Front  |          8  |      4.9000  |         200  |  4100  |          1510  |               No  |             18.0000  |           6  |     206  |        114  |     73  |           43  |              35  |            18  |    3620  |      USA  |         Cadillac DeVille  |             1.5625  |
    10  |      Cadillac  |       Seville  |  Midsize  |    37.5000  |  40.1000  |    42.7000  |        16  |           25  |  Driver & Passenger  |       Front  |          8  |      4.6000  |         295  |  6000  |          1985  |               No  |             20.0000  |           5  |     204  |        111  |     74  |           44  |              31  |            14  |    3935  |      USA  |         Cadillac Seville  |             1.5625  |
    70  |    Oldsmobile  |  Eighty-Eight  |    Large  |    19.5000  |  20.7000  |    21.9000  |        19  |           28  |         Driver only  |       Front  |          6  |      3.8000  |         170  |  4800  |          1570  |               No  |             18.0000  |           6  |     201  |        111  |     74  |           42  |            31.5  |            17  |    3470  |      USA  |  Oldsmobile Eighty-Eight  |         1.47368421  |
    74  |       Pontiac  |      Firebird  |   Sporty  |    14.0000  |  17.7000  |    21.4000  |        19  |           28  |  Driver & Passenger  |        Rear  |          6  |      3.4000  |         160  |  4600  |          1805  |              Yes  |             15.5000  |           4  |     196  |        101  |     75  |           43  |              25  |            13  |    3240  |      USA  |         Pontiac Firebird  |         1.47368421  |
     6  |         Buick  |       LeSabre  |    Large  |    19.9000  |  20.8000  |    21.7000  |        19  |           28  |         Driver only  |       Front  |          6  |      3.8000  |         170  |  4800  |          1570  |               No  |             18.0000  |           6  |     200  |        111  |     74  |           42  |            30.5  |            17  |    3470  |      USA  |            Buick LeSabre  |         1.47368421  |
    13  |     Chevrolet  |        Camaro  |   Sporty  |    13.4000  |  15.1000  |    16.8000  |        19  |           28  |  Driver & Passenger  |        Rear  |          6  |      3.4000  |         160  |  4600  |          1805  |              Yes  |             15.5000  |           4  |     193  |        101  |     74  |           43  |              25  |            13  |    3240  |      USA  |         Chevrolet Camaro  |         1.47368421  |
    76  |       Pontiac  |    Bonneville  |    Large  |    19.4000  |  24.4000  |    29.4000  |        19  |           28  |  Driver & Passenger  |       Front  |          6  |      3.8000  |         170  |  4800  |          1565  |               No  |             18.0000  |           6  |     177  |        111  |     74  |           43  |            30.5  |            18  |    3495  |      USA  |       Pontiac Bonneville  |         1.47368421  |
    56  |         Mazda  |          RX-7  |   Sporty  |    32.5000  |  32.5000  |    32.5000  |        17  |           25  |         Driver only  |        Rear  |     rotary  |      1.3000  |         255  |  6500  |          2325  |              Yes  |             20.0000  |           2  |     169  |         96  |     69  |           37  |              NA  |            NA  |    2895  |  non-USA  |               Mazda RX-7  |         1.47058824  |
    18  |     Chevrolet  |      Corvette  |   Sporty  |    34.6000  |  38.0000  |    41.5000  |        17  |           25  |         Driver only  |        Rear  |          8  |      5.7000  |         300  |  5000  |          1450  |              Yes  |             20.0000  |           2  |     179  |         96  |     74  |           43  |              NA  |            NA  |    3380  |      USA  |       Chevrolet Corvette  |         1.47058824  |
    51  |       Lincoln  |      Town_Car  |    Large  |    34.4000  |  36.1000  |    37.8000  |        18  |           26  |  Driver & Passenger  |        Rear  |          8  |      4.6000  |         210  |  4600  |          1840  |               No  |             20.0000  |           6  |     219  |        117  |     77  |           45  |            31.5  |            22  |    4055  |      USA  |         Lincoln Town_Car  |         1.44444444  |

A Regression Example

The Morpheus API includes a regression interface in order to fit data to a linear model using either OLS, WLS or GLS. The code below uses the same car dataset introduced in the previous example, and regresses Horsepower on EngineSize. The code example prints the model results to standard out, which is shown below, and then creates a scatter chart with the regression line clearly displayed.


//Load the data
var url = "https://d3x.s3.amazonaws.com/data/public/morpheus/cars93.csv";
DataFrame.read(url).csv(Integer.class, options -> {
    options.setExcludeColumnIndexes(0);
    options.setHeader(true);
});

//Run OLS regression and plot
var regressand = "Horsepower";
var regressor = "EngineSize";
data.regress().ols(regressand, regressor, true, model -> {
    IO.println(model);
    var xy = data.cols().select(regressand, regressor);
    Chart.create().withScatterPlot(xy, false, regressor, chart -> {
        chart.title().withText(regressand + " regressed on " + regressor);
        chart.subtitle().withText("Single Variable Linear Regression");
        chart.plot().style(regressand).withColor(Color.RED).withPointsVisible(true);
        chart.plot().trend(regressand).withColor(Color.BLACK);
        chart.plot().axes().domain().label().withText(regressor);
        chart.plot().axes().domain().format().withPattern("0.00;-0.00");
        chart.plot().axes().range(0).label().withText(regressand);
        chart.plot().axes().range(0).format().withPattern("0;-0");
        chart.show();
    });
    return Optional.empty();
});

==============================================================================================
                                   Linear Regression Results                                                            
==============================================================================================
Model:                                   OLS    R-Squared:                            0.5360
Observations:                             93    R-Squared(adjusted):                  0.5309
DF Model:                                  1    F-Statistic:                        105.1204
DF Residuals:                             91    F-Statistic(Prob):                  1.11E-16
Standard Error:                      35.8717    Runtime(millis)                           52
Durbin-Watson:                        1.9591                                                
==============================================================================================
   Index     |  PARAMETER  |  STD_ERROR  |  T_STAT   |   P_VALUE   |  CI_LOWER  |  CI_UPPER  |
----------------------------------------------------------------------------------------------
  Intercept  |    45.2195  |    10.3119  |   4.3852  |   3.107E-5  |    24.736  |   65.7029  |
 EngineSize  |    36.9633  |     3.6052  |  10.2528  |  7.573E-17  |    29.802  |   44.1245  |
==============================================================================================

Data Visualization

Visualizing data in Morpheus DataFrames is made easy via a simple chart abstraction API with adapters supporting both JFreeChart as well as Google Charts (with others to follow by popular demand). This design makes it possible to generate interactive Java Swing charts as well as HTML5 browser based charts via the same programmatic interface.

PASSION AND A CULTURE OF ENGINEERING EXCELLENCE DRIVES US

To Serve As Your Financial Technology Partner

Thank you for your interest! We will be in touch shortly.
Oops! Something went wrong while submitting the form.