Breaking out multipart questions as separate questions

Consider an election question in which there are 3 candidates, Bush, Trump and Rubio who might win.  Suppose people forecast the outcome of this question by assigning probability X to Bush, Y to Trump and Z to Rubio, where probabilities are between 0 and 100.  Only one person can win, so X = 100 – (Y + Z) = 100 – Y – Z.  Intuitively, looking at this expression, the correlation between X and Y ought to be -1 and similarly corr(X,Z) should be -1.  However note while Y in U[0,100], Z in U[0,Y].  If Y and Z were independent, their sum would follow an Irwin-Hall distribution.  But given Z in U[0,Y], what distribution does Y+Z follow?

I did a simulation experiment and found the correlation for X,Y to be -65% and for X and Z to be -14%, for 10,000 paths, and consistently so with increasing paths.   This is the Python code for the experiment:


from numpy.random import *
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats.stats import pearsonr
N=10000
Y=random_integers(0,100,N)
Z=np.vectorize(lambda y: random_integers(0,100-y))(Y)
X=100-(Y+Z)
print "rho(x,y)", pearsonr(X,Y)
print "rho(x,z)", pearsonr(X,Z)
plt.clf()
plt.scatter(X,Y)
plt.title("X vs Y, X=100-(Y+Z)")
plt.xlabel("X")
plt.ylabel("Y")
plt.savefig("corrxy.png")
plt.clf()
plt.scatter(X,Z)
plt.title("X vs Z, X=100-(Y+Z)")
plt.xlabel("X")
plt.ylabel("Z")
plt.savefig("corrxz.png")


Here is a scatterplot of X and Y:corrxy.png

 

Here is a scatterplot of X and Z:

corrxz.png

These plots are virtually identical, for 10,000 paths, though Y is denser towards the top of the triangle and Z is denser below.

I don’t know how to explain this difference in correlations, and I don’t know what is the PDF of (100-(Y+Z)).  The former may have to do with the quality of the NumPy uniform random number generator.  Or it could be due to the Pythagorean Theorem.  The latter is math that’s a little bit above my head.  (Even the PDF of the uncorrelated sum is a bit hairy.)

The reason this is interesting is that I am forecasting questions in an open forecasting tournament called GJOpen.com.  We are assigned accuracy scores based on final outcome.  However it is possible for parts of a question to become determined before all parts of a question are determined.  For example, a question which is whether rainfall by December 31st will be 0-10 inches/year, 10-15 inches/year or > 15inches/year.  If it rains at least 10 inches, then the first bin is closed but the other two are still live.  For such questions, it would be helpful to be scored separately on each bin.  However, the bins are correlated, as above.  Then the problem is how to accomodate in scoring accuracy for correlated bins.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s