Problem:

Suppose you have a bunch of 2D data points which:

  1. May include points with the same x-value but different y-values; and
  2. May be unevenly-spaced with respect to the x-values.

How do you clean up the data so that, for instance, you are free to construct a connected data plot, or perform a Discrete Fourier Transform? Please note that Curve Fitting and the Lomb–Scargle Method, respectively, are more effective techniques for these particular applications. Let's start with a simple example for illustration. Consider this Matrix:

A := < 2, 5; 5, 8; 2, 1; 7, 8; 10, 10; 5, 7 >;

Consolidate:

First, sort the rows of the Matrix by the first column, and extract the sorted columns separately:

P := sort( A[..,1], output=permutation ); # permutation to sort rows by the values in the first column
U := A[P,1]; # sorted column 1
V := A[P,2]; # sorted column 2

We can regard the sorted bunches of distinct values in U as a step in a stair case, and the goal is replace each step with the average of the y-values in V located on each step.

Second, determine the indices for the first occurrences of values in U, by selecting the indices which give a jump in x-value:

m := numelems( U );
K := [ 1, op( select( i -> ( U[i] > U[i-1] ), [ seq( j, j=2..m ) ] ) ), m+1 ];
n := numelems( K );

The element m+1 is appended for later convenience. Here, we can quickly define the first column of the consolidated Matrix:

X1 := U[K[1..-2]];

Finally, to define the second column of the consolidated Matrix, we take the average of the values in each step, using the indices in K to tell us the ranges of values to consider:

Y1 := Vector[column]( n-1, i -> add( V[ K[i]..K[i+1]-1 ] ) / ( K[i+1] - K[i] ) );

Thus, the consolidated Matrix is given by:

B := < X1 | Y1 >;

Spread Evenly:

To spread-out the x-values, we can use a sequence with fixed step size:

X2 := evalf( Vector[column]( [ seq( X1[1]..X1[-1], (X1[-1]-X1[1])/(m-1) ) ] ) );

For the y-values, we will interpolate:

Y2 := CurveFitting:-ArrayInterpolation( X1, Y1, X2, method=linear );

This gives us a new Matrix, which has both evenly-spaced x-values and consolidated y-values:

C := < X2 | Y2 >;

Plot:

plots:-display( Array( [
        plots:-pointplot( A, view=[0..10,0..10], color=green, symbol=solidcircle, symbolsize=15, title="Original Data", font=[Verdana,15] ),
        plots:-pointplot( B, view=[0..10,0..10], color=red, symbol=solidcircle, symbolsize=15, title="Consolidated Data", font=[Verdana,15] ),
        plots:-pointplot( C, view=[0..10,0..10], color=blue, symbol=solidcircle, symbolsize=15, title="Spread-Out Data", font=[Verdana,15] )
] ) );

Sample Data with Noise:

For another example, let’s take data points from a logistic curve, and add some noise:

# Noise generators
f := 0.5 * rand( -1..1 ):
g := ( 100 - rand( -15..15 ) ) / 100:

# Actual x-values
X := [ seq( i/2, i=1..20 ) ];

# Actual y-values
Y := evalf( map( x -> 4 / ( 1 + 3 * exp(-x) ), X ) );

# Matrix of points with noise
A := Matrix( zip( (x,y) -> [x,y], map( x -> x + f(), X ), map( y -> g() * y, Y ) ) );

Using the method outlined above, and the general procedures defined below, define:

B := ConsolidatedMatrix( A );
C := EquallySpaced( B, 21, method=linear );

Visually:

plots:-display( Array( [
    plots:-pointplot( A, view=[0..10,0..5], symbol=solidcircle, symbolsize=15, color=green, title="Original Data", font=[Verdana,15] ),
    plots:-pointplot( B, view=[0..10,0..5], symbol=solidcircle, symbolsize=15, color=red, title="Consolidated Data", font=[Verdana,15]  ),
    plots:-pointplot( C, view=[0..10,0..5], symbol=solidcircle, symbolsize=15, color=blue, title="Spread-Out Data", font=[Verdana,15] )
] ) );

  

Generalization:

Below are more generalized custom procedures, which are used in the above example. These also account for special cases.

# Takes a matrix with two columns, and returns a new matrix where the new x-values are unique and sorted,
# and each new y-value is the average of the old y-values corresponding to the x-value.
ConsolidatedMatrix := proc( A :: 'Matrix'(..,2), $ )

        local i, j, K, m, n, P, r, U, V, X, Y:
  
        # The number of rows in the original matrix.
        r := LinearAlgebra:-RowDimension( A ):

        # Return the original matrix should it only have one row.
        if r = 1 then
               return A:
        end if:

        # Permutation to sort first column of A.
        P := sort( A[..,1], ':-output'=permutation ):       

        # Sorted first column of A.
        U := A[P,1]:

        # Corresponding new second column of A.
        V := A[P,2]:

        # Return the sorted matrix should all the x-values be distinct.
        if numelems( convert( U, ':-set' ) ) = r then
               return < U | V >:
        end if:

        # Indices of first occurrences for values in U. The element m+1 is appended for convenience.
        m := numelems( U ):
        K := [ 1, op( select( i -> ( U[i] > U[i-1] ), [ seq( j, j=2..m ) ] ) ), m+1 ]:
        n := numelems( K ):

        # Consolidated first column.
        X := U[K[1..-2]]:

        # Determine the consolidated second column, using the average y-value.
        Y := Vector[':-column']( n-1, i -> add( V[ K[i]..K[i+1]-1 ] ) / ( K[i+1] - K[i] ) ):

        return < X | Y >:

end proc:

# Procedure which takes a matrix with two columns, and returns a new matrix of specified number of rows
# with equally-spaced x-values, and interpolated y-values.
# It accepts options that can be passed to ArrayInterpolation().
EquallySpaced := proc( M :: 'Matrix'(..,2), m :: posint )

        local A, i, r, U, V, X, Y:

        # Consolidated matrix, the corresponding number of rows, and the columns.
        A := ConsolidatedMatrix( M ):
        r := LinearAlgebra:-RowDimension( A ):
        U, V := evalf( A[..,1] ), evalf( A[..,2] ):

        # If the consolidated matrix has only one row, return it.
        if r = 1 then
               return A:
        end if:

        # If m = 1, i.e. only one equally-spaced point is requested, then return a matrix of the averages.
        if m = 1 then
               return 1/r * Matrix( [ [ add( U ), add( V ) ] ] ):
        end if:

        # Equally-spaced x-values.
        X := Vector[':-column']( [ seq( U[1]..U[-1], (U[-1]-U[1])/(m-1), i=1..m ) ] ):

        # Interpolated y-values.
        Y := CurveFitting:-ArrayInterpolation( U, V, X, _rest ):    

        return < X | Y >:

end proc:

Worth Checking Out:

 

Please Wait...