# Distributome Activity on Sample Sizes and the Accuracy of Polls

## Introduction

Surveys about public opinions on controversial social issues are becoming increasingly frequent as topics such as the legalization of marijuana, abortion policy, marriage rights for homosexuals, and immigration policy are hotly debated in the media.

For example, both the Opinion Research Corporation (polling for CNN) and the Pew Research Center for The People and The Press conducted surveys of American adults in spring of 2011 to estimate the percentage of the public that favors the legalization of marijuana. The sample sizes in the two polls were 824 for the Opinion Research Corporation poll and 1504 for the Pew poll.

## Goals

This activity illustrates the inter-distribution relationships between Cauchy, Student’s T and Standard Normal (Gaussian) distributions.  These relationships are used to provide a better understanding of how strongly sample sizes are related to the accuracy of polls.

## Hands-on Activity

In this activity you may assume that both of these pollsters use similar techniques that involve telephone interviews and weighting the answers given by individuals to align the respondents demographics with population values and finally averaging to produce unbiased and essentially normally distributed estimates. Below are 4 related, but complementary, problems regarding this study.

Note: The problems below may be appropriate for an undergraduate course in probability. The last part (Problem 4) would be more appropriate for masters level course (and should have a General Cauchy distribution tag and a tag for its relationship to the bivariate normal).

## Specific Problems and their Solutions

### Problem 1: Difference in Poll Accuracies?

The Pew poll had almost twice the sample size of the Opinion Research Corporation poll. What is the chance that it was more accurate than that poll for estimating p = the percentage of American adults that favored the legalization of marijuana in spring, 2011? Be sure to clearly define how you are interpreting “more accurate.” Also state and justify any assumptions you make in solving for this probability.

## Problem 2: Pooling Data across Polls?

Describe how you would combine the data from these two polls to form a single estimate of $$p$$.

The obvious choice is to propose a linear combination of the two estimates weighting inversely proportional to the variances to get the smallest overall variance amongst such linear combinations.

## Problem 3: Are these probability estimates correlated?

What is the correlation between your estimate above and the individual estimate produced by the Pew poll?

Note that $$Cov(\hat{p},\hat{p_1})=Cov(\frac{1504}{2328}\hat{p_1}+\frac{824}{2328}\hat{p_2}, \hat{p_1})=\frac{1504}{2328}\sigma_1^2=\frac{\sigma^2}{2328}$$.

## Problem 4: Accuracy of probability estimates?

What is the probability that your combined estimate (from the second problem) is more accurate than the estimate based only on the Pew poll?

We want $$P[|\hat{p}-p| < |\hat{p_1}-p| ]$$.

## Generalized Cauchy distribution CDF derivation

To derive the generalized Cauchy distribution CDF directly, we start with the bivariate normal distribution of ”X” and ”Y”:

## Cauchy, Student’s T and Gaussian distribution interrelations

The Student’s T-distribution represents a one-parameter homotopy path connecting Cauchy and Gaussian Distribution:

## Conclusions

Increasing the sample size may help significantly in certain situations – but not as much as intuition often suggests.

# Homicides Trend Activity

Distributome Homicide Trends Activity

Overview

A Columbus Dispatch newspaper story on Friday January 1, 2010 discussed a drop in the number of homicides in the city the previous year. Here are the first few paragraphs from the article:

• Homicides take big drop in city: Trend also being seen nationally, but why is a mystery.
• The number of homicides in Columbus dropped 25 percent last year after spiking in 2008. As of last night, the city was expected to close out 2009 with 83 homicides, 27 fewer than in 2008, according to records kept by police and The Dispatch. In 2007, 79 people were slain in Columbus. “I don’t know that there’s one reason for homicides going up or down,” said Lt. David Watkins, supervisor of the Police Division’s homicide unit.
• Why one year do we have 130, and then the next year we have 80?
• “You just can’t explain it,” Sgt. Dana Norman said. He supervises the third-shift squad that investigated 44 of last year’s homicides, which occurred at a rate of 11.1 for every 100,000 people in Columbus, based on recent population estimates …

A table appearing with the article showed that there were 568 homicides in the previous 6 years.

Hands-on Activity

Sargent Norman’s statement that “”You just can’t explain it”” presents an intriguing probability question – Is it possible that natural random fluctuation might be a good explanation? Let’s consider probability models for the number of observed crimes and how they might fluctuate to see if the data mentioned in the article is unusual.

• If homicides are rare events that might be independently perpetrated by individuals in a large population – what distribution would approximately describe the number of murders in a year?

A reasonable model would be the Poisson distribution (since the mean is quite large, a normal model with equal mean and variance would be an alternative approximation).

• Suppose the expected annual number of homicides in the city is denoted by $$\lambda$$ and that the number of homicides is independent from year to year. The article notes that 2008 saw a “spike” in the number of homicides and in fact that was the highest number in the last six years. If nothing is going on except random fluctuations – we want to know if observing 27 fewer homicides in 2009 after the peak year is unusual (peak here meaning the highest in the last 6 years).

Use the Distributome Poisson simulator for the model you specified above to examine the distribution of the change in the number of homicides you would see following a peak of a six year stretch. Does the 27 murder drop seem unusual? Explain.

See a Hint

To get started, you will need
i) to find an estimate of $$\lambda$$ to use in your simulations, and
ii) to examine groups of 7 years of simulated homicide data and isolate those cases that satisfy the conditions of the problem.

See the First part of the Answer

There were 568 homicides in the preceding six years so a reasonable estimate of $$\lambda$$ would be $$\lambda \equiv \frac{568}{6} \equiv \frac{284}{3} \equiv 82.67$$. From a simulation of 100,000 sets of six independent Poisson variables, we find the maximum would have a distribution with a histogram that looks like this image.

See the Second part of the Answer

The difference between this maximum and another independent $$Poisson(\lambda \equiv 82.67)$$ variable would have a histogram that looks like the following:

The shaded region corresponds to values of at least 27, which happens about 12% of the time so the drop of homicides in Columbus would not be particularly unusual when nothing is happening but regular random fluctuations.

Alternative Approach

This problem might also be viewed as an example of the regression effect where you should expect a regression to the mean following a very high observed value.

Conclusions
When viewing a random process over time it is the extremes that make the headlines – so the probability models we should use to answer the question “What is unusual?” should be probability models about extremes.

# Colorblindness Activity

Distributome Colorblindness Activity

Overview

Colorblindness – Can you see the number in this image?

This Distributome Activity illustrates an application of probability theory to study Colorblindness, typically a genetic disorder which results from an abnormality on the X chromosome. The condition is thus rarer in women since a woman would need to have the abnormality on both of her X chromosomes in order to be colorblind (whether a woman has the abnormality on one X chromosome is essentially independent of having it on the other).

Goals

The goal of this activity is to demonstrate an efficient protocol of estimating the probability that a randomly chosen individual may be colorblind.

Hands-on Activity

Suppose that $$p$$ is the probability that a randomly selected ”man” is colorblind.

• 100 men are selected at random. What is the distribution of $$X_m$$ = the number of these men that are colorblind?
• 100 women are selected at random. What is the distribution of $$X_f$$ = the number of these women that are colorblind?

• To estimate the probability that a randomly selected woman is colorblind, you might use the proportion of colorblind women in a sample of n women. What is the variance of this estimator?
• Alternatively, to estimate the probability that a randomly selected woman is colorblind, you might use the square of the proportion of colorblind men in a sample of n men. Explain why this estimate makes sense. What is the variance of this estimator?
• For large samples, is it better to use a sample of men or a sample of women to estimate the probability that a randomly selected women is colorblind? Explain.

Alternate approach

You can also use the delta method to find the approximate variance for the estimator above.

Conclusions

In practice, it may difficult to obtain reliable parameter estimates when the event at hand is very rare (such as with colorblindness in women). The use of a valid probability model such as the relationship between the chance of colorblindness in men and the chance in women may improve these estimates.

# Distributome Blog allows LaTeX Post Editing Using MathJax

The Distributome Blog now allows editing using MathJax-based math typography. For example:

• Typing \$$\\int\_{\\pi}^{\\infty}{\\ln (x) dx} \$$, replace the “\\” by “\” to render the formula in the blog page,
• Would generate this: $$\int_{\pi}^{\infty}{\ln (x) dx}$$
For hidden fields you need to use the following alternatives as analogues of commonly used TeX/LaTeX syntax (as the JavaScript code behind the MathJax and the Hidden-answers plug-ins are incompatible):
• For equal-sign “=”, use the $$\\equiv$$ symbol ($$\equiv$$)
• For vertical bar “|”, use $$\\vert$$ symbol ($$\vert$$)
• There may be other LaTeX/TeX alternative symbols that may need to be used for MathJax math typesetting in hidden fields!

# Distributome Navigator: Ontology/Hierarchical Graph Display

We introduced a new Distributome.xml.pref file, which allows customization of the look-and-feel of the Distributome Navigator and Editor.

One example of this customization is the ability to display hierarchically the Ontology of the collection of distributions contain in the Distributome XML DB. That is we have a mechanism to render the nodes and edges as 3 levels: Top, Middle or All/Complete levels (affording to the pref file). The figure below illustrates this new hierarchical Distributome Navigator display.

# Distributome BibTeX citation manager

We have finalized the new format for the Distributome meta-data about (XML) distributions and relations (Distributome.xml) and (BiBTeX) bibliographical citations (Distributome.bib).

There is a new Distributome DB/Meta-data HTML validator which renders the entire database, including references and citation URL links into a dynamic HTML webpage.

Background/Initial Proposal

The Distributome meta-data editor will provide an interactive bibliography BibTeX citation manager. This prototype contains an example demonstrating how we can elegantly handle Citations/references (parsing, editing, writing, etc.) using pure HTML5/JavaScript:

1. Background: Using BibTex-js project
2. Example HTML (Distributome_BibTeX.html) that interactively consumes raw BiBTeX source files, converts them to JS/JSON and displays the references in HTML page.
3. An example of raw BibTeX source file (BibTeX_ExampleCitations.bib). These BibTeX sources can easily be obtained by users from the “Citation Download” or “Export Citation” links on most publisher’s web-sites (See this example). So, these BiB sources are very easy to copy-paste into the Distributome References Editor Panel from another web-browser-window.
4. A JavaScript library (bibtex_js.js), which we may need to extend, that allows parsing BiB files and generating the JSON constructs hat are then rendered in the HTML (example, during editing in the Bibliography/References panel of the Distributome Editor, or in the References tab of the Distributome Navigator, during

We decouple the references Section (Distributome_BibTeX.bib) from the main Distributome.xml DB, as the Distributome BiBTeX reference may increase to become very large (error prone). Thus we’ll need to have a way to reference publications (from Distributome_BibTeX.bib) into theDistributome.xml. This can be achieved by the DOI (Unique Digital Object Identifier) or the URL that every publication has. So inside the <cite>Publication_DOI_or_URL</cite> tag of distribution-nodes or relation-edges in the main Distributome.xml DB, we’ll just have pointers to unique DOI’s – the same unique DOI/URL will be available in the Distributome_BibTeX.bib source file. Hence we can pair the references by their unique DOIs/URLs.

Then inside the Distributome_BibTeX.bib, each reference will have its unique DOI – this will enable linking of Bibliographic/references meta-data contained in Distributome_BibTeX.bib (on-demand) from inside the Distributome.xml and the Navigator, itself. This may be an easy, clean and scalable approach.

There are many examples of resources for retrieving BibTeX publication references (bibliographic citations):

Below are examples of the second generation (V2) of the Distributome XML meta-data – this version decouples the meta information about distirbutions (nodes) and their relation (edges) from the reference citation management (using BibTeX):
Both the XML and the BibTeX files still need to be expanded, but they illustrate the integration (mapping) between distributions(nodes)/relations(edges) and the corresponding bibliographical references (citations) using the unique Digital Object Identifiers (DOI) or URL addresses, specific to each publication/citation. See

# Distributome Update: December 07, 2011

• Distributome Human XML DB Search. We modified the search functionality to include:
• Bibliography (Reference Manager) – we are still exploring BibJSON. Jim’s Distributome BibJSON construct looks great. We just need to figure out how to computationally consume (parse, read, load) and produce (revise, update, modify, save) these references programmatically. Looking for specific tools and examples of how we can accomplish these 2 operations from the Distributome site/webapps? Are there open-source HTML5/JS parsers for BibJSON and how to tie these with the Distributome.xml DB?
• Resource Debugging: For technical users, we’ve introduced an optional debugging functionality documented here. This is the infrastructure that we’ll be now populating.
• Distributome Editor: We are in the process of implementing the user-interface for editing the XML DB meta-data in the browser. Aiming to complete this in the next 1-2 months.
• Distributome Navigator Layered/Multiscale View: Trying to simplify the Navigator view by introducing hierarchical multiscale rendering of the Distributome DB (nodes and edges). We’re working on this and the approach is to employ a new Distributome.xml.pref (preferences file) that allows specifying diverse run-time Navigator behaviors, incl. the hierarchy of Distributions/Relations to display. Please have a look at the current (15+15+All) list of 3-level hierarchy and let us know if we need modifications. This functionality will be live in the next few weeks.

# Distributome Update November 11, 2011

Distributome Project infrastructure developments as of November 11, 2011:

• Web-site:
• Updated the web-site and the host server (improved response time)
• Modification – we are still working on the Distributome DB Editor – a graphical interface would allow anyone to modify, expand, correct, save and submit revisions to the DB.
• Clean-up: We’ll initiate DB clean up (with help from students and the community) as soon as the Graphical Distributome Editor is available
• Bibliography (Reference Manager) – we are exploring BibJSON and are trying to identify the best protocol for retrieval (put/generate/make/edit/revise) existing and commit (call/retrieve/search) new biblio/reference items.

# Distributome Update November 04, 2011

• Abstracted the old Java resources as “legacy”.
• Introduced “Tech Docs” and “License” pages. Please scan over and let me know if we need updates.
• Fixed a number of bugs in the Distributome Navigator, improved functionality, enhanced display of meta-data in properties panels (according to SAB recommendations). Still working on improving the Navigator (e.g., displaying fewer nodes/distributions, by default).

# Distributome Usability Training Video

We need to script and produce a 5-10 minute video representing the functionaility of the Distributome Infrastudcture. Example script could include:

1. What is the Distributome Project? Team? History? Goals? Vision?
2. Why are Probability Distributions interesting and useful as models (math, bio, computation)?
3. The 5 Distributome Use-Cases (distribution property and inter-distribution relation exploration, sampling/simulation, computation, modeling)
4. HTML5 Navigator (Navigator functionality, browser/device agnostic)
5. Distribution Properties (nodes), Inter-distributional relations (edges), references (citations)
6. Distributome XML DB Editor (under development), include technical info about XML/XSD format
7. Distributome Web-Service (under development)
8. Community Contributions
9. Classroom utilization and learning activities
10. Future developments