DataFrames: An example from the 2020 U.S. Presidential election

(Or why DataFrames are more powerful and readable than spreadsheets.)


In this example of working with DataFrames, the goal is to use a spreadsheet from a website, which contains polling data, to estimate the probability each of the two candidates from the major parties will win the US Presidential election in November.  I first tried doing the calculations with a spreadsheet, but I discovered DataFrames was far more powerful. Warning: This worksheet uses live data. Hence the outcome at the end of the worksheet is likely to change daily. A more extensive example with even more common DataFrame operations should be available soon.


How the US Presidential election works - highly simplified version: In the US there are only two parties for which their candidate could win the election:  the Democratic party and Republican party. The Republican party is often referred to as the "Grand Old Party", or GOP. Each state executes its own election. The candidate who receives the most votes wins the states "electoral votes" (EV). The number of the electoral votes for each state is essentially proportional to the population of the state. A candidate who receives a total of 270 or more EVs out of 538, is declared the president of the US for the next term, which starts January 20 of 2021.


Creating DataFrame from web based data:

First I download the data from the website. It is a CSV spreadsheet.


restart; interface(displayprecision = 3); interface(rtablesize = [6, 8]); web_data := Import("")


Each row contains information about a poll conducted in one of the states.  The first poll starts on row 2, hence the number of polls are:

Npolls := upperbound(web_data, 1)-1


Now I want to create a new DataFrame containing only the most useful information. In web_data, many are the columns are not important. However I do want to keep the column label names from those columns I wish to retain.


web_data_cols := [1, 3, 4, 5, 6]; column_labels := convert(web_data[1, web_data_cols], list)

["Day", "State", "EV", "Dem", "GOP"]


Because  the first poll in web_data is labeled 2, I would like to relabel all the polls starting from 1

row_labels := [seq(1 .. Npolls)]


Creating a DataFrame from a Matrix or another DataFrame:  (with row labels and column labels)


Now I can build the DataFrame that I will be working with:


poll_data := DataFrame(web_data[2 .. (), web_data_cols], 'columns' = column_labels, 'rows' = row_labels)


What each column means

* "Day" - day of the year in 2020 when the poll within the state was halfway completed. The larger the value, the more recent the poll.

* "State" - the state in the US where the poll was conducted. The candidate that receives the most votes "wins the state".

* "EV" - the number of electoral votes given to the candidate who receives the most votes within the state.

* "Dem" - the percentage of people who said they are going to vote for the candidate from the Democratic party.

* "GOP" - the percentage of people who said they are going to vote for the candidate from the Republican party.


By using the sort function, using the `>` operator, I can see which polls are the more recent. (If you run the worksheet yourself, the outcome will change as more polls are added to the website spreadsheet.)

poll_data := sort(poll_data, "Day", `>`)



Selecting Unique entries - by column values:

For the my simple analysis, I will use only the most recent poll, one from each state. Hence, using AreUnique, I can pull the first row that matches a state name. This new DataFrame called states.


states := poll_data[AreUnique(poll_data["State"])]


(Note, one of the "states" is the District of Columbia, D.C., which is why there are 51 rows.)


Removing a column: (and relabeling rows)

This next example isn't necessary, but shows some of the cool features of DataFrames.


Since there is only 1 entry per state, I'm going to remove the "State" column and relabel all the rows with the state names

state_names := convert(states["State"], list); states := DataFrame(Remove(states, "State"), 'rows' = state_names)




Indexing by row labels:

This allow me to to display information by individual states. What is the data for California, Maine and Alaska?

states[["California", "Maine", "Alaska"], () .. ()]



Mathematics with multiple-columns:


My preference is to work with fractions, rather than percentages. Hence I want all the values in the "Dem" and "GOP" to be divided by 100 (or multiplied by 1/100).  Treating each column like a vector, the multiplication is performed individually on each cell. This is what the tilda, "~", symbol performs.

states[["Dem", "GOP"]] := `~`[`*`](states[["Dem", "GOP"]], 1/100.); states



Mathematics: using a function to calculate a column


For the next action, I want to use the power of the Statistics package to create a "probability of winning the state" function.


For simplicity, I will assume the outcome of the voting in a state is purely random, but is conditional to popularity of each candidate as measured by the polls. I'll assume the likelihood of an outcome follows a normal (Gaussian) distribution with the peak being at point where the difference of the polling of the two candidates is zero. (Note, other than 2016, where there was an unusually larger percentage of undecided voters on election day, this simple model is reasonable accurate. For example, in 2012, of the states which appeared to be the "closest", the winner over-performed his polling in half of them, and under-performed in the other half with a mean difference of nearly zero.)  From previous elections, the standard deviation of differences between polling values and the actual outcome is at most 0.05, however, it does increase with the fraction of undecided voters.


To mathematically model this situation, I have chosen to use the "Cumulative Density Function" CDF in the Statistics package. It will calculate the probability that a candidate polling with fraction f1 wins the election if the other candidate is polling with fraction f2.  The variable u is the fraction of undecided voters. It is included in the calculation to increase the spread of the possible outcomes.


win_prob := Statistics:-CDF(Statistics:-RandomVariable(Normal(0., 0.5e-1+(1/4)*u)), f1-f2)



Converting this expression into a function using the worst named function in Maple, unapply:

win_prob_f := unapply(evalf(win_prob), [f1, f2, u])

proc (f1, f2, u) options operator, arrow; .5000000000+.5000000000*erf(.7071067810*(f1-1.*f2)/(0.5e-1+.2500000000*u)) end proc


Now I can calculate a DataFrames column of the "win probability", in this case, for the candidate from the Democratic platy. By apply the function, individually, using the columns "Dem" and "GOP", I produce:

dem_win_prob := `~`[win_prob_f](states["Dem"], states["GOP"], `~`[`-`](1, `~`[`+`](states["Dem"], states["GOP"])))


Appending a column:


I can add this column to the end of the states with the label "DemWinProb":


states := Append(states, dem_win_prob, label = "DemWinProb")



Mathematics of adding the entries of a column:


How many electoral votes are available? add them up.

Total_EV := add(states["EV"])



While the number of EV a candidate wins is discrete, I can use the "win probability" from each state to estimate the total number of EV each of the candidates might win. This means adding up number of EV in each state times, individually, the probability of winning that state:

Dem_EV := round(add(`~`[`*`](states["EV"], states["DemWinProb"])))


Currently, the candidate from the Democratic party is likely to win more then 300 electoral vtes.


What about for the candidate from the Republican / "GOP" party?

gop_win_prob := `~`[win_prob_f](states["GOP"], states["Dem"], `~`[`-`](1, `~`[`+`](states["Dem"], states["GOP"]))); GOP_EV := round(add(`~`[`*`](states["EV"], gop_win_prob)))


Summing the two EV values, we obtain the total number of electoral votes.






Please Wait...