Items tagged with data-analysis

Feed

A project that I have been working on is adding some functionality for Cluster Analysis to Maple (a small part of a much bigger project to increase Maple’s toolkit for exploratory data mining and data analysis). The launch of the MapleCloud package manager gave me a way to share my code for the project as it evolves, providing others with some useful new tools and hopefully gathering feedback (and collaborators) along the way.

At this point, there aren’t a lot of commands in the ClusterAnalysis package, but I have already hit upon several interesting applications. For example, while working on a command for plotting clusters of points, one problem I encountered was how to draw the minimal volume enclosing ellipsoid around a group (or cluster) of points. After doing some research, I stumbled upon Khachiyan’s Algorithm, which related to solving linear programming problems with rational data. The math behind this is definitely interesting, but I’m not going to spend any time on it here. For further reading, you can explore the following:

Khachiyan’s Algorithm had previously been applied in some other languages, but to the best of my knowledge, did not have any Maple implementations. As such, the following code is an implementation of Khachiyan’s Algorithm in 2-D, which could be extended to N-dimensional space rather easily.

This routine accepts an Nx2 dataset and outputs either a plot of the minimum volume enclosing ellipsoid (MVEE) or a list of results as described in the details for the ‘output’ option below.

MVEE( X :: DataSet, optional arguments, additional arguments passed to the plotting command );

The optional arguments are as follows:

  • tolerance : realcons;  specifies the convergence criterion
  • maxiterations : posint; specifies the maximum number of iterations
  • output : {identical(data,plot),list(identical(data,plot))}; specifies the output. If output includes plot, then a plot of the enclosing ellipsoid is returned. If output includes data, then the return includes is a list containing the matrix A, which defines the ellipsoid, the center of the ellipse, and the eigenvalues and eigenvectors that can be used to find the semi-axis coordinates and the angle of rotation, alpha, for the ellipse.
  • filled : truefalse; specifies if the returned plot should be filled or not

Code:

#Minimum Volume Enclosing Ellipsoid
MVEE := proc(XY, 
              {tolerance::positive:= 1e-4}, #Convergence Criterion
              {maxiterations::posint := 100},
              {output::{identical(data,plot),list(identical(data,plot))} := data},
              {filled::truefalse := false} 
            )

    local alpha, evalues, evectors, i, l_error, ldata, ldataext, M, maxvalindex, n, ncols, nrows, p1, semiaxes, stepsize, U, U1, x, X, y;
    local A, center, l_output; #Output

    if hastype(output, 'list') then
        l_output := output;
    else
        l_output := [output];
    end if;

    kernelopts(opaquemodules=false):

    ldata := Statistics:-PreProcessData(XY, 2, 'copy');

    nrows, ncols := upperbound(ldata);
    ldataext := Matrix([ldata, Vector[column](nrows, ':-fill' = 1)], 'datatype = float');

    if ncols <> 2 then
        error "expected 2 columns of data, got %1", ncols;
    end if;

    l_error := 1;

    U := Vector[column](1..nrows, 'fill' = 1/nrows);

    ##Khachiyan Algorithm##
    for n to maxiterations while l_error >= tolerance do

        X := LinearAlgebra:-Transpose(ldataext) . LinearAlgebra:-DiagonalMatrix(U) . ldataext;
        M := LinearAlgebra:-Diagonal(ldataext . LinearAlgebra:-MatrixInverse(X) . LinearAlgebra:-Transpose(ldataext));
        maxvalindex := max[index](map['evalhf', 'inplace'](abs, M));
        stepsize := (M[maxvalindex] - ncols - 1)/((ncols + 1) * (M[maxvalindex] - 1));
        U1 := (1 - stepsize) * U;
        U1[maxvalindex] := U1[maxvalindex] + stepsize;
        l_error := LinearAlgebra:-Norm(LinearAlgebra:-DiagonalMatrix(U1 - U));
        U := U1;

    end do;

    A := (1/ncols) * LinearAlgebra:-MatrixInverse(LinearAlgebra:-Transpose(ldata) . LinearAlgebra:-DiagonalMatrix(U) . ldata - (LinearAlgebra:-Transpose(ldata) . U) . LinearAlgebra:-Transpose((LinearAlgebra:-Transpose(ldata) . U)));
    center := LinearAlgebra:-Transpose(ldata) . U;
    evalues, evectors := LinearAlgebra:-Eigenvectors(A);
    evectors := evectors(.., sort[index](1 /~ (sqrt~(Re~(evalues))), `>`, ':-output' = ':-permutation'));
    semiaxes := sort(1 /~ (sqrt~(Re~(evalues))), `>`);
    alpha := arctan(Re(evectors[2,1]) / Re(evectors[1,1]));

    if l_output = [':-data'] then
        return A, center, evectors, evalues;
    elif has( l_output, ':-plot' ) then
            x := t -> center[1] + semiaxes[1] * cos(t) * cos(alpha) - semiaxes[2] * sin(t) * sin(alpha);
            y := t -> center[2] + semiaxes[1] * cos(t) * sin(alpha) + semiaxes[2] * sin(t) * cos(alpha);
            if filled then
                p1 := plots:-display(subs(CURVES=POLYGONS, plot([x(t), y(t), t = 0..2*Pi], ':-transparency' = 0.95, _rest)));
            else
                p1 := plot([x(t), y(t), t = 0..2*Pi], _rest);
            end if;
        return p1, `if`( has(l_output, ':-data'), op([A, center, evectors, evalues]), NULL );
    end if;

end proc:

 

You can run this as follows:

M:=Matrix(10,2,rand(0..3)):

plots:-display([MVEE(M,output=plot,filled,transparency=.3),
                plots:-pointplot(M, symbol=solidcircle,symbolsize=15)],
size=[0.5,"golden"]);

 

 

As it stands, this is not an export from the “work in progress” ClusterAnalysis package – it’s actually just a local procedure used by the ClusterPlot command. However, it seemed like an interesting enough application that it deserved its own post (and potentially even some consideration for inclusion in some future more geometry-specific package). Here’s an example of how this routine is used from ClusterAnalysis:

with(ClusterAnalysis);

X := Import(FileTools:-JoinPath(["datasets/iris.csv"], base = datadir));

kmeans_results := KMeans(X[[`Sepal Length`, `Sepal Width`]],
    clusters = 3, epsilon = 1.*10^(-7), initializationmethod = Forgy);

ClusterPlot(kmeans_results, style = ellipse);

 

 

The source code for this is stored on GitHub, here:

https://github.com/dskoog/Maple-ClusterAnalysis/blob/master/src/MVEE.mm

Comments and suggestions are welcomed.

 

If you don’t have a copy of the ClusterAnalysis package, you can install it from the MapleCloud window, or by running:

PackageTools:-Install(5629844458045440);

 

hello i was wondering is it possible to find the standard error of the coefficents in the NonlinearFit function within the statistical package? or in any other package for that matter?

does someone have angorithme that can find it? 

i've searched around the internet and it does not seem like maple employs this function at all?

hope someone can help:)

Aggregate statistics are calculated by splitting the rows of a DataFrame by each factor in a given column into subsets and computing summary statistics for each of these subsets.

The following is a short example of how the Aggregate command is used to compute aggregate statistics for a DataFrame with housing data:

To begin, we construct a DataFrame with housing data: The first column has number of bedrooms, the second has the area in square feet, the third has price.

bedrooms := <3, 4, 2, 4, 3, 2, 2, 3, 4, 4, 2, 4, 4, 3, 3>:
area := <1130, 1123, 1049, 1527, 907, 580, 878, 1075,
1040, 1295, 1100, 995, 908, 853, 856>:
price := <114700, 125200, 81600, 127400, 88500, 59500, 96500, 113300,
104400, 136600, 80100, 128000, 115700, 94700, 89400>:
HouseSalesData := DataFrame([bedrooms, area, price], columns = [Bedrooms, Area, Price]);

Note that the Bedrooms column has three distinct levels: 2, 3, and 4.

convert(HouseSalesData[Bedrooms], set);

The following returns the mean of all other columns for each distinct level in the column, Bedrooms:

Aggregate(HouseSalesData, Bedrooms);

Adding the columns option controls which columns are returned.

Aggregate(HouseSalesData, Bedrooms, columns = [Price])

Additionally, the tally option returns a tally for each of the levels.

Aggregate(HouseSalesData, Bedrooms, tally)

The function option allows for the specification of any command that can be applied to a DataSeries. For example, the Statistics:-Median command computes the median for each of the levels of Bedrooms.

Aggregate(HouseSalesData, Bedrooms, function = Statistics:-Median);

By default, Aggregate uses the SplitByColumn command to creates a separate sub-DataFrame for every discrete level in the column given by bycolumn.

with(Statistics);
ByRooms := SplitByColumn(HouseSalesData, Bedrooms);

We can create box plots of the price for subgroups of sales defined by number of bedrooms.

BoxPlot( map( (m)->m[Price], ByRooms), 
deciles=false,
datasetlabels=["2 bdrms", "3 bdrms", "4 bdrms"],
color=["Red", "Purple", "Blue"]);

 

I have recorded a short video that walks through this example here: https://youtu.be/e0pqCMyO3ks

The worksheet for this example can be downloaded here: Aggregate.mw

This is the second of three blog posts about working with data sets in Maple.

In my previous post, I discussed how to use Maple to access a large number of data sets from Quandl, an online data aggregator. In this post, I’ll focus on exploring built-in data sets in Maple.

Data is being generated at an ever increasing rate. New data is generated every minute, adding to an expanding network of online information. Navigating through this information can be daunting. Simply preparing a tabular data set that collects information from several sources is often a difficult and time consuming effort. For example, even though the example in my previous post only required a couple of lines of Maple code to merge 540 different data sets from various sources, the effort to manually search for and select sources for data took significantly more time.

In an attempt to make the process of finding data easier, Maple’s built-in country data set collects information on country-specific variables including financial and economic data, as well as information on country codes, population, area, and more.

The built-in database for Country data can be accessed programmatically by creating a new DataSets Reference:

CountryData := DataSets:-Reference( "builtin", "country" );

This returns a Reference object, which can be further interrogated. There are several commands that are applicable to a DataSets Reference, including the following exports for the Reference object:

exports( CountryData, static );

The list of available countries in this data set is given using the following:

GetElementNames( CountryData );

The available data for each of these countries can be found using:

GetHeaders( CountryData );

There are many different data sets available for country data, 126 different variables to be exact. Similar to Maple’s DataFrame, the columns of information in the built-in data set can be accessed used the labelled name.

For example, the three-letter country codes for each country can be returned using:

CountryData[.., "3 Letter Country Code"];

The three-letter country code for Denmark is:

CountryData["Denmark", "3 Letter Country Code"];

Built-in data can also be queried in a similar manner to DataFrames. For example, to return the countries with a population density less than 3%:

pop_density := CountryData[ .., "Population Density" ]:
pop_density[ `Population Density` < 3 ];

At this time, Maple’s built-in country data collection contains 126 data sets for 185 countries. When I built the example from my first post, I knew exactly the data sets that I wanted to use and I built a script to collect these into a larger data container. Attempting a similar task using Maple’s built-in data left me with the difficult decision of choosing which data sets to use in my next example.

So rather than choose between these available options, I built a user interface that lets you quickly browse through all of Maple’s collection of built-in data.

Using a couple of tricks that I found in the pages for Programmatic Content Generation, I built the interface pictured above. (I’ll give more details on the method that I used to construct the interface in my next post.)

This interface allows you to select from a list of countries, and visualize up to three variables of the country data with a BubblePlot. Using the preassigned defaults, you can select several countries and then visualize how their overall number of internet users has changed along with their gross domestic product. The BubblePlot visualization also adds a third dimension of information by adjusting the bubble size according to the relative population compared with the other selected countries.

Now you may notice that the list of available data sets is longer than the list of available options in each of the selection boxes. In order to be able to generate BubblePlot animations, I made an arbitrary choice to filter out any of the built-in data sets that were not of type TimeSeries. This is something that could easily be changed in the code. The choice of a BubblePlot could also be updated to be any other type of Statistical visualization with some additional modifications.

You can download a copy of this application here: VisualizingCountryDataSets.mw

You can also interact with it via the MapleCloud: http://maplecloud.maplesoft.com/application.jsp?appId=5743882790764544

I’ll be following up this post with an in-depth post on how I authored the country selector interface using programmatic content generation.

 

how to call matlab to run k means on the data and

how to show which circle do the row of data belong to?

Page 1 of 1