More notes from RootsCampDC...

 

(Viewed: times)

 

Notes From "Modelling for Dummies"

Led by David Boyle in a personal capacity. beglen at gmail dot com

(works at Catalist)

 

This session focused on the basics of how models are constructed, with an emphasis on understanding the modelling process, and setting forth a low-end modelling process that can be used by campaigns with limited budgets. While these models will not rival their costly counterparts in quality, they still represent a significant improvement over traditional targeting for several reasons: (1) they are more fine-tuned that traditional segmentation; and (2) when done properly, the predictability of the models is verified, making their performance better than anecdotal.

 

On a basic level, modelling is the process of taking a small set of data -- typically survey results -- and then using those results to predict attitudes of an entire universe.  In a campaign context, this generally plays out as follows:

 

  1. You have a set of data on people from a voter file.  This includes information like gender, vote history, census information, and hopefully some commercial data, such as martial status, length of residence, etc.  It's important to distinguish between different groupings of data, like individual, household, census level, county level, and state level data.  Having more individual or household level data points (as opposed to census level data points) will greatly improve the quality of your model.

  2. You have a subset of info -- support IDs, or a scientific poll taken on a good number of people in your universe. Something like 5,000 IDs would be ideal. The more unbiased the sample of people on whom the IDs are based the better. Its best for it to be as representative as possible of the complete set of people you're interested. Try to make it balanced by region, gender, age and other things. The more unbiased the question that they were asked the better, also.

  3. You want to take the lessons of the subset and apply them to the rest of the file.  Typically you would want to give everyone on the voter file a score from 0-100 for a yes/no question (e.g., supports your candidate, will turnout to vote, etc.) indicating the percentage likelihood that a given individual will come up positive for that question.

 

To generate model scores, you follow three steps:

 

  1. Isolate the data that you want (the dependent variable).  This is typically the data you've collected, like support scores or survey results. If you are using IDs that are stored in your voter file, consider taking a subset of the IDs that you really trust. You might want to exclude some canvassers that you don't trust, or some IDs entered from an event where everyone at the event was a supporter: they aren't a balanced set of IDs and will sway the model.

  2. Figure out what might correlate to the dependent variable.  This is a fancy way of describing the vote history, party affiliation, and census and consumer information that you have for most everybody on the file. You don't need to decide yourself what DOES correlate. The program will do this. But gather the data together that MIGHT.

  3. The program that you use create the model will tell you which of those variables is important, and will then create a "combination of coefficients" which it will use to create your score.

 

To generate a model, Boyle started with a voter file sample in Excel that contained 3,000 responses from a phone survey.  He then loaded it in AnswerTree, which is an SPSS module. In AnswerTree, Boyle then ran a full CHAID (Chi-Square Automatic Interaction Detector) analysis.  AnswerTree starts with a single box containing your entire universe.  When you click on that particular box, it shows you all of your variables, from the most significant to the least.  You choose a variable with a high Chi-square score and a low P-value, and it gives you two or more branches based on that variable.  You can then repeat the process of each of the branches until you've reached the level of granularity that you're seeking.

 

At the end of this process, you'll have a model.  The next step -- which is very important -- is to verify your results.  Call through a number of people who had not been previously ID'ed, and see if the model seems like a good predictor of support.  The more people you call, the better. You should probably aim to call about a hundred people who the model thought would be 30% likely to support your candidate and see if around 30 of them were. Then call around 100 people who the model thought would be 70% likely to support your candidate. If around 70 of them do, then it looks like your model is a helpful predictor of actual behavior.

 

Once your model has been verified, append it to your original voter file and start cutting better universes. Organizers and volunteers will be able to pull lists of voters who are, for example, 30%-70% likely to support your candidate for persuasion programs, and 70%+ likely to support your candidate to GOTV them.

 

Question and Answer:

 

Q: If your original survey was only of certain demographic or geographic location, can you still apply your model to the entire voter file?

A: Your results are only valid for the universe of your original survey.  If you only called women, your model would only be predictive of female behavior. If your IDs are only for people in Baltimore City, beware of applying a model created from them to people who live in rural areas.

 

 

Q: Given limited resources and statistical knowledge, how do candidates get through this? 

A: Catalist provide lots of data points that can help models to be built. If you're thinking about it, contact Copernicus Analytics, Ken Strasma or other modelling firms.  They will talk to you about doing it properly.  If there isn't budget for that, some modelling companies will work with you on your data, including finding gaps to fill in, rather than conducting an entire new poll.  If, for instance, all of your IDs focus on certain counties, they'll help you by telling you which counties you need to call into to create a valid model for your entire universe.

 

With that said, more expensive models produce better results, and building a home-grown model, or a model on the cheap, shouldn't been seen as an equal substitute to a full vendor model.

 

... but don't think that the $100,000 option is the only one out there.


Page Information

  • 1 year ago [history]
  • View page source
  • You're not logged in
  • Tags: data field

Wiki Information

Recent PBwiki Blog Posts