Using census, social security and tax data from the Multi-Agency Data Integration Project (MADIP) to impute the complete Australian income distribution

TTPI Working Paper 8/2021

Authors: Nicholas Biddle & Dinith Marasinghe (Australian National University)

The distribution of income, income dynamics and how observable characteristics predict an individual’s position on the income distribution are all core aspects of economics and social science research, and of keen interest to policy makers. Researchers approach these topics using a combination of cross-sectional surveys, panel studies, and administrative datasets. In Australia, all three types of datasets have been used historically to help answer such questions, without any one individual dataset being without limitations in terms of sample size, sample representation, quality of income data, or longitudinal availability. A relatively new dataset – the Multi-Agency Data Integration Partnership (MADIP) Basic Longitudinal Extra (BLE) – has the potential to extend our knowledge of income in Australia by combining income-related data from a targeted survey (the 2011 or 2016 Censuses of Population and Housing), income tax records at the individual level, and information on access to social security. As individual datasets, there are limits of each. However, one way to overcome the limitations of the individual datasets on the MADIP BLE is to combine them to create a synthetic income measure for each individual. For 2011, this is a relatively straightforward exercise, as there are three sources of information for each individual. For the other years though, there are only two sources of information – PIT and SSRI. To overcome this limitation, we borrow information from the first wave of data (2011) to help estimate income for the remaining years (2012-16). After testing nine machine-learning approaches using a training and test dataset from the MADIP BLE 2011, we were able to generate a synthetic income measure that performed far better than either tax or census data alone in matching the HILDA income distribution, and was also able to capture income dynamics reasonably well, albeit with some understating of income dynamics. This new synthetic income data is available for further analysis for over 15 million individuals, compared to only around 17,000 for HILDA and even less for other sample surveys.

Download the paper

Comments are closed.