Beyond The Commons

Beyond The Commons

Aaron Wherry covers all the goings-on in and around Parliament Hill. Follow Aaron on Twitter: @aaronwherry

The serious trouble of sample selection bias

by Aaron Wherry on Monday, July 5, 2010 10:15am - 0 Comments

Stephen Gordon expands on his concerns about government changes to the census.

The implications are more wide-ranging than you might think. According to popular cliché, we are evolving towards a ‘knowledge economy’ – and knowledge requires data.  And the usefulness of much of these data depends crucially on the anchor of a reliable census.

Genealogists, economists, academics and Canada’s former chief statistician, as well as the editorial boards of the Toronto Star, Montreal Gazette, Edmonton Journal and Victoria Times-Colonist, are equally concerned. But at least one libertarian is pleased.

Bookmark and Share
  • http://intensedebate.com/people/LynnTO LynnTO

    And the usefulness of much of these data depends crucially on the anchor of a reliable census.

    Agreed. On some things, you can get away with a self-selected methodology (read: most online surveys) because you have non-self-selected Census data with which to refer and to which you can weight your data, in order that it is representative as possible, given the circumstances.

    Without reliable Census data, research becomes a shot in the dark: if we don't know how many households actually have six children, and we have a massive cohort of them responding to a survey (of academic, policy, or even corporate origin), we don't know how to manage that cohort's responses. There's the foreseeable result that the data could be wrong, and as a result, decisionmakers are misinformed by well-meaning but inaccurate data.

    • Guy

      It takes Statistics Canada 2 years process the data into coherent information. By that time, the data as a means to allocate resources is useless. Enough people are born, enough die, enough move, enough marry, enough divorce and enough change careers to significantly alter the population landscape. Add to the fact that by the time something actually gets done with that information, the next census would have already passed. The long form is unnecessary since any information is provides it stale dated.

      As an example, the Province of Manitoba has recently reorganized its electoral districts based on the 2006 census. In my neighborhood in 2006 there were less than 200 homes. Now there are over 2000. This area will be underrepresented until a redistribution based on the 2016 census (the 2020 election!).

      Anything other than a head count is of no real use. And the head count would only be useful if it can be processed within a few weeks.

      • http://intensedebate.com/people/tedbetts tedbetts

        So the solution to having less than immediate perfect imformation is to… make it even less accurate. Hmmmm.

        Saying the census data is "useless" and "of no real use" is ridiculous.

        Looking at the data from census to census shows that it is incredibly useful and valuable. Looking at how often governments and private sector use the information also kinda obliterates your point.

        • Guy

          The data is used because it is there. It is not accurate in any way, shape or form. If an organization today were to use the information from the 2006 census to develop a business model, how long would that organization continue to exist? The economy has changed completely. The census only measures a point in time.

          Furthermore, there is no guarantee that the information on the long form was accurate to begin with. In 2006, several groups campaigned vigorously to have people submit incomplete, incorrect and/or inaccurate information. If you are one of those who gets a long form in 2011 and choose to submit it, it is highly likely you will be as accurate as possible with the information you provide. Therefore, coupled with increasing the number of long forms by 50%, StatsCan will receive better information. A better point of time measure will then exist. It will still, of course, be inaccurate the day after the census.

          As Canadians, we should welcome this one time the Government will respect your privacy.

          • http://intensedebate.com/people/tedbetts tedbetts

            "Not accurate in any way, shape or form". The "economy has changed completely". You just keep rolling them out don't you. Too funny.

            Doubly funny how you say the census is in no way accurate at all but at the same time that the Great Harper's change will make them better.

            Makes me think you are either a PMO troll trying to throw any argument at the subject in defence of a stupid move or you are really just clueless. Sorry, bud, but the information is used a very lot by a lot of companies that are more successful for it, let alone government policy making for which it is critical.

            The economy has not completely changed in the last two years; indeed, it has hardly changed. The only thing that has changed is that we have now 4 parties instead of 3 who are strong supporters of government stimulus, deficit spending and corporate welfare. And the census will show that.

          • Guy

            The data is accurate on the day of the census ONLY! 2 years later, the data is useless . Unemployment is up substantially from 2006, governments are running massive deficits, bank loans are harder to get. The economy is barely moving right now. We are operating in a completely different economic climate. But since you're looking at old data, you didn't notice. You just made my point.

            And you are right, we now have 4 parties that support government stimulus. If you were using the 2006 census information, you wouldn't have known that. Thanks for making my point for me again.

            Sorry, I don't work for the PMO. Heck, I'm not even a Tory!

          • http://intensedebate.com/people/tedbetts tedbetts

            Guy:

            As has been noted in several places, the most important information gathered by the census is trends. That it was 2 years later does not change the trend and the trends that get measured do not swing so violently every two years as to make useless. If they were useless, they would not be used. Duh. Would it be better to have it more frequently? Certainly. Trends could be spotted quicker, would be even more accurate, even more immediately useful in policy making. But just because they could be better doesn't mean they are useless. To say so is quite ridiculous.

            And to think that the census is the only measure that Stats Can takes is also ridiculous nonsense. The economic ups and downs, employment ups and downs require immediate analysis of current data. Which is why the census is not relied upon for that information and why the census doesn't collect that kind of information. Double duh.

          • tedbetts

            But why am I arguing with you? You think that the entire economy is now completely changed. Ontario is suddenly no longer a manufacturing/agriculture economy. Alberta is no longer an oil economy. In fact, Canada is no longer primarily a resource based economy. You can no longer go into stores to buy consumer goods. Entrepreneurs are a thing of the past. We got rid of our banking system. No one is driven by the profit motive. Supply and demand – what the heck is that? We got rid of money too and I didn't notice it because the census data is published 2 years after collected!!! You are so right. Everything. Has. Completely. Changed.

          • Guy

            How many jobs has Ontario's manufacturing sector lost since 2006? How many Canadians have had to sell their homes? How big is Alberta's deficit? Our economy has changed substantially since 2006. Show me the information in the 2006 census that would have predicted this. The economic trends found in the 2006 census are now irrelevant.

            Now any trends found in the 2011 census from the 2006 census will also be irrelevant. The world's economy is a mess and, quite frankly, anything can happen. Relying on data that's several years old is bad practice.

            The census is an historical record. As they say in the financial sector, previous performance does not guarantee future success.

          • http://intensedebate.com/people/StephenGordon StephenGordon

            Those trends are picked up by the Labour Force Survey – which is also mandatory.

          • http://intensedebate.com/people/tedbetts tedbetts

            Our financial situation has changed substantially since 2006. The economy has not. The same industries drive the economy. Almost all of the same people are in the same industries. Most of the same companies are still there though lots have gone under. Unemployment is not too far off what it was in 2006. The same rules govern. The same motivations govern.

            You are completely messing up economies and the current financial situation. You are also completely messing up what the census records and what data is collected from elsewhere. In short, you have no idea what you are talking about and I have far better things to do. Cheerio.

          • http://intensedebate.com/people/LynnTO LynnTO

            The data is accurate on the day of the census ONLY

            Really? That you have children is accurate for one day only? That you have a university education is accurate for one day only? Give me a freakin' break.

            Unemployment measurements are taken both in and outside the Census. Economic measures are taken both in and outside the Census. Population measurements are taken both in and outside the Census. Census measurements, as I've already said, are not just about sheer numbers, but rates, and rates of change. And it's those rates that form the standard of measurement for every level of government.

          • Guy

            And given how much our economy and society has changed in the last 3 years (and yes, it has changed significantly), government should not be using that information. The previous trends hasve been demolished.

            Since StatsCan tracks the information in other ways, the census becomes redundant as well. Let's add to the fact each province has an office of Vital Statistics that tracks information on a DAILY basis. The census isn't worth the time or the dollars put into it.

          • http://intensedebate.com/people/LynnTO LynnTO

            And what, do you suppose, provincial statistics collections compare their data to?

          • Holly Stick

            Just another stupid decision from the stupid ignorant Conservatives. Go tell your employers in the PMO they had better reverse this dumb decision.

      • http://intensedebate.com/people/LynnTO LynnTO

        The long-form isn't usually completed by every household anyway; many questions are completed by a 10% sample. The purpose is, therefore, not entirely about sheer numbers, but about rates, and rates of change.

        Establishing if the rate of change from the previous long-form is linear or non-linear is essential to census projections, whether or not it takes two years to use the data is not relevant. As you say, enough marry, enough divorce: the Census question is whether while some have divorced, an equal number have married; and, if the rates of each have increased or decreased in a particular way.

        • http://intensedebate.com/people/tedbetts tedbetts

          And so Harper's change will make the data from different periods difficult if not impossible to compare since they will be measuring from different pools. It just makes no sense.

          • http://intensedebate.com/people/LynnTO LynnTO

            Given a 10% sample in any given census year, the idea is that, using random sample selection, the same 10% would not be surveyed in consecutive iterations. Now, that's not always practically true, but that's the general idea. In a 10% sample, the laws of statistics say that there'd be little margin of error, so having independent samples each time is not a significant issue, if it were to happen.

            The issue becomes when that random sample selection is affected by response bias or self-selection (et cetera). Then, the sample isn't truly random anymore, which affects a statistician's ability to draw conclusions with the degree of confidence that they otherwise would.

          • http://intensedebate.com/people/tedbetts tedbetts

            Exactly. Not exactly comparing apples to oranges, but macintoshes to granny smiths perhaps.

          • Guy

            But do we need the information in the first place? Since the statistics can be reasonably interpreted in many ways, or ignored, governments can choose what information it needs to justify any "government policy". Now replace the words "government policy" with "social engineering", since they are the exact same thing.

            Now, do you remember the "government policy" enacted in Europe in the early 1940's? This is the extreme end of the spectrum, but think about it. No good can come of any government holding detailed information about its citizens.

            Your freedom is much more important than any information.

          • http://intensedebate.com/people/LynnTO LynnTO

            But do we need the information in the first place?

            The government uses that information to make projections for how much budget items might cost. Want universal health care? How much is that going to cost, and how would you determine "cost"? Not simply by the number of people who exist in the province, but by their age, dwelling type, number of persons in the household, et cetera. So yeah, if we want to come up with any sort of reasonable estimate for how much programs are going to cost, we need as much information as possible.

            Not so the government can hunt you down if you don't have blue-eyed children, but so the government knows how much of what you pay in tax dollars is likely to go out the door, and where. This isn't about freedom, or privacy, at all. It's about getting reliable information so the government can make reasonable estimations. Your paranoia reminds me of Tommy from Snatch.

          • Lord Kitchener's Own

            It could be worse. In the U.S., the right wingers got people so riled up about the Census that a few Census enumerators were actually SHOT AT.

          • Guy

            If I'm a health minister in a province, why would I use census information, that is several years old and just a sampling, when I could get the complete population information, that has been updated daily, from my provincial bureau of Vital Statistics? Seems to me that 100% information, with a very high probability of being accurate, would work better for me than a sampling where maybe 25% of the information received was incorrect? Why would I build a hospital based on obviously useless information?

          • Holly Stick

            I believe the provinces get much of their information from Stats Canada. The stupid ignorant Conservatives are not just stopping one government program, they are tearing the whole infrastructure down and doing serious harm to Canada.

            Let's get rid of the fools!

          • http://intensedebate.com/people/LynnTO LynnTO

            If this is any indication, offices of Vital Statistics do not collect education, work status, dwelling type, and household size (since they don't really track beyond birth/marriage/death).

            As a Health Minister, if I rely on "vital statistics" alone, I'm using incomplete information to make my estimates.

            Furthermore, survey sampling can be likened to a blood test: when your doctor is diagnosing you, one of the tools at his or her disposal is a blood test. When they conduct this test, they draw a sample of your blood – not all of it – to review the details of what's going on in your blood, which helps them to diagnose if you've got anemia, leukemia, or just a common cold. The same applies to survey sampling: ya don't have to knock on every door to have a pretty good idea of the detailed answers you'll find behind it.

            And this notwithstanding, longform questions are also asked on other – mandatory – StatCan surveys. Altering the methodology to self-selection compromises the comparability of that data.

          • Guy

            But if the doctor suspects lung cancer, an x-ray is done of the whole lung, not just a corner of it…blood is homogeneous. A population is not and even less so over time.

            And items such as education, work status, dwelling type and household size are not determining factors for building a hospital. Population size, age and birth/death rates are. And those can be determined from vital statistics. And, information not included in any census (i.e. location of nearest hospital, construction costs, election promises) are even greater factors in the determination.

            As for comparability, as a recipient of the long form for 4 consecutive census (so much for random, as I have moved 3 times during that period), I can tell you with 100% certainly (with no sampling error or fudging) that each form was different. Enough so that any comparisons and/or extrapolations should not be taken too seriously.

            Finally, it is a sad day indeed when people think that someone's privacy is irrelevant. So, sending me to jail for refusing to tell StatsCan and its pencil- pushing bureaucrats how many toilets I have in my house is good because…

          • http://intensedebate.com/people/LynnTO LynnTO

            Population size, age and birth/death rates are. And those can be determined from vital statistics

            Not as long as we have first generation immigrants who weren't born here, they can't. Household size is most certainly relevant, because if one person in a household of twelve contracts a communicable illness, the eleven other people become a higher risk. Similarly, if the household size is two, the risk falls to only one other person (all other factors, such as out-of-home interaction, being equal). Work status is an indicator of whether or not a person is home during the day, and would therefore make more use of services proximate to their home or workplace. And, to cite one particular example of many, education, income, and health are linked.

            Privacy is important – but the purpose of collecting this somewhat sensitive data is not to target you as an individual; in aggregate, it is to inform public policy and public spending.

          • Guy

            For example, anyone who moves into my province has to register with the health authority and apply to vital statistics for identification. So, first generation immigrants are tracked by vital statistics.

            Household size is irrelevant in your example. You can use the H1N1 outbreak as evidence. The determining factors were who was most susceptible and what were the logistics in the distribution of the vaccine (with family in community health I experienced this first hand). Relying on 3 year old data would have been useless (and potentially fatal to some); the provincial government relied on it health data base, which is maintained on a daily basis.

            Since information exists elsewhere federally (and federal agencies, by law, must supply StatsCan with requested information) and provincially, the census has become redundant. Better information is available sooner. It makes more sense to use it tha attempt to duplicate it.

            As for services, demand is what will decide what services are offered and when. Only a bureaucrat would suggest that services should not be offered at the convenience of the user, only at the convenience of the supplier. This explains the proliferation of walk-in clinics and non-emergency urgent care facilites open past 5 PM and on weekends, at least in my city.

          • http://intensedebate.com/people/LynnTO LynnTO

            Also, the principle of the metaphor is that you don't take the whole when the part will show you what you need and want to know – it would be intrusive, expensive, and probably not even necessary.

            You can use this as an argument to abolish the long-form altogether, but not as one to justify making it "voluntarily opt-in"; you don't screw with a sound methodology and then proclaim that you're smarter for it. Either do it, or don't do it at all.

          • Guy

            Part will only do if you are taking information from a homogeneous sample. The population is not homogeneous over time.

            Would you think it would be appropriate for your doctor to diagnose you based on a 3 year old blood sample? This is what you are working with using the census information. Better information exists, use it instead.

            You are also assuming the following:

            1) All information collected is 100% accurate. Given the organized effort in 2006 to encourage Canadians to give incorrect information on the census, it is highly unlikely that this is so.

            2) Distribution of the long form was random. I have received the long form in each of the last 4 census. I have moved 3 times. I know of others who have received the long form multiple times. The distribution of the forms is systemically flawed.

            3) All long forms are returned. It has been estimated that 5% of all the long forms were not returned, mostly by aboriginals and the poor. Given that the target for census based social planning are most often these groups, the data received is incomplete.

            Statistical methodology is fine, but if the information received is of poor quality, the results are meaningless. This is a significant problem in our society, governments at all levels are not properly executing social programming. Using the census as an information base is quite obviously causing problems.

          • http://intensedebate.com/people/LynnTO LynnTO

            In all of your arguments, you've presumed that the statisticians at StatsCanada lack the ability to add, multiply, or otherwise analyze data to project trends over time.

            They don't. In fact, they have it in spades. The head of StatsCan has been in the business for 40 years. Other economists and academics and people who stare at numbers all day agree with him. Your arguments may be valid, but I don't see how they still stand up to the reality that flawing your methodology does NOT improve the result. Saying that you're flawing your methodology in the name of protecting the privacy of individuals who, by and large, share that information in good faith anyway, doesn't make sense. And, you can say that statistics are out of date effective the day after they were gathered, but that doesn't counter the ability of StatsCan data to reflect general population outcomes over time.

          • http://intensedebate.com/people/LynnTO LynnTO

            And you can rail against randomization all you like, but it works. Randomization works. It's worked for 60 years. If you think targeting and cycling is a better way to go about it, test it out and put a prospectus together for the guys in Ottawa, I'm sure the statisticians would love it (and no, I'm not being glib, I actually think they'd love to try something new if it has a shot of working).

  • Lord Kitchener's Own

    I'm trying to figure out if this move by the government is one of shocking ignorance (i.e. they don't realize the huge problems they're going to cause) or shocking malevolence (i.e. they know EXACTLY what they're doing, and are more than happy to put the government in a better position to manipulate the populace without being burdened by troublesome "facts" and "objective reality").

    Either way, this is a pretty bad idea.

    • http://intensedebate.com/people/tedbetts tedbetts

      Why do you rule out both?

      • Lord Kitchener's Own

        Huh. Malicious ignorance. You're right, it could be that.

        • http://intensedebate.com/people/tedbetts tedbetts

          Ah, from two possibilities to four: ignorance? maliciousness? malicious ignorance? or, and this is what I think is closest, ignorant maliciousness?

          • http://intensedebate.com/people/Halo_Override Halo_Override

            Malignorance.

    • http://intensedebate.com/people/PolJunkie PolJunkie

      I go for shocking ignorance…

    • http://intensedebate.com/people/madeyoulook madeyoulook

      Malevolent shock? No, that's for Polish visitors to YVR.

      • Lord Kitchener's Own

        Too soon.

        • Sandra Finley

          (1 of 2) I am on trial; I didn't fill in the 2006 census – trial continues Sept 9th. This mess arises because Public Works & StatsCan out-sourced census work to Lockheed Martin Corporation – the American military/Pentagon. Lockheed helped decide to launch an illegal war of aggression on Iraq. They have been manufacturers of land mines & cluster munitions, both in contravention of International & Canadian law. They have a long list of court convictions and are well-known for procurement fraud (bilking tax-payers). They spend millions on lobbying & political contributions, rewarded with Govt contracts.

    • Sandra Finley

      (2 of 2) I may have won the court case on a legal argument: the Charter of Rights & Freedoms does not allow Govt to coerce (jail time & a fine) citizens into handing over a "biographical core of personal information" (the long form). Democracies uphold that right because of historical record of abuse in militaristic states – I recommend you read "IBM & the Holocaust". LEGAL argument aside, the MORAL argument against giving tax-payer money to Lockheed Martin is more important than charter right. Lockheed is responsible for death and destruction in untold numbers. History will provide the label “evil”. Why would we allow Lockheed Martin into our country? Or the American military? Both should be on trial for murder, along with the Bush Administration. As long as Lockheed Martin is involved in the Canadian census, I will not fill in a census form. (Lockheed is also into international surveillance. Gaining access to the Canadian census data base would be quite convenient.)

  • http://intensedebate.com/people/madeyoulook madeyoulook

    As I said on Aaron's earlier post, the ONLY thing that makes sense is the personal privacy angle. But if they really meant it, they would have canned the long form entirely, not expanded its reach while making it optional. It truly is the worst of both worlds.

    • http://intensedebate.com/people/TJCook TJCook

      Funny, our government seems to have mastered the "worst of both worlds" angle on the policy front.

  • http://intensedebate.com/people/Stewart_Smith Stewart_Smith

    Does anyone know if someone opting out of the long form is still required to fill out the short form?

    • Lord Kitchener's Own

      My understanding was that the short form would still be mandatory.

  • Anon 001

    Yes, but how dare they challenge the intelligence of the one and only Tony Clement, Porkmaster General of Canada?

  • Greg

    It's being done for the "first the government came for my long gun" wing of the party. Just like everything else this government does.

    • http://intensedebate.com/people/tedbetts tedbetts

      Bingo.

    • John D

      Don't you wish that your life was so great that your biggest complaint was "I have to fill out a census form"?

  • Dee

    The Conservatives haven't been pro-knowledge from the start. Why would they treat Statistics Canada any different than the experts they didn't listen to regarding nuclear safety, GST cuts, climate change, environmental protection, research and development policy,… and on and on… The word "philistines" comes to mind.

    • http://intensedebate.com/people/tedbetts tedbetts

      The conservative war on science has been well documented and is not, as some have with good intentions but mistakenly concluded, just a religious conservative war on science.

      • Holly Stick

        The religious conservative war on science is part of it; but there is also the servile, kowtowing to the greedy foreign oil megacorporations and to big business in general. Harper is still chasing GW Bush's fantasy that liars can create their own reality and impose it on the country. Fake lake, anyone?.

  • Holly Stick

    Like the old Smothers Brothers routine: richer people always have more clothes on. Most of us are less-ons. So who's running the country? The more-ons.

  • Reader

    Our government doesn't need a census. They will simply listen in on your phone calls, intercept your on-line communication…

    • http://intensedebate.com/people/LynnTO LynnTO

      According to their own privacy laws, they can't do that.

      Besides, a mail-back Census is much more efficient than spying on 34,000,000 Canadians.

      • Guy

        Bingo! Now if you don't return the census, they have even less to go on!

  • http://intensedebate.com/people/Geiseric Geiseric

    Fascism is not without its efficiencies.

    • http://intensedebate.com/people/Tridus Tridus

      Oh if only… but this will actually be even more expensive then the old method.

      Spending more money to get less results. That's the Conservative way!

  • hosertohoosier

    I wholeheartedly agree that the census is necessary for good government. However, if people do have concerns about privacy, perhaps there are ways to address these in a way that does not impact the quality of census data. Indeed, dealing with privacy concerns is ALSO vital to ensuring the quality of data. People are more likely to be honest if they know that their data will be kept private.

    • http://intensedebate.com/profiles/datalibrarian datalibrarian

      Statistics Canada takes their responsibility to keep your data private *very* seriously. No one who is not a STC employee gets to see any personally identifiable information — not even other government employees.

  • http://intensedebate.com/people/stats4U stats4U

    Stephen Gordon refers to the voluntary nature of the Census long form as leading to sample selection bias. While sample selection bias is never a good thing, this is not correct and he confuses sample selection bias for non response bias. Sample selection bias does not occur with random selection from the target population.

    The problem with a voluntary survey of this size (1/3 of households would be randomly selected) is that it will lead to significant non-response. This non response is potentially benign if the profile (or distribution) of non responders is the same as those who responded in terms of the observed variables. Meaning, the non responders would have given similar answers to the questions as those who responded. Unfortunately, all statisticians know that non responders are different. The Harper government and unfortunately the current chief statistician Munir Sheikh does not even understand this elementary principle of survey design…before making this decision and potentially wasting taxpayer dollars, a good statistician would conduct a test to measure and develop a strategy for minimizing non response bias.

From Macleans