Matching Data with Optimal Transport: Theory and Application to Income Data


Researchers in economics would often benefit from combining information from distinct datasets, for example using records from an administrative source and from survey data. This is necessary to estimate the joint distribution of variables not jointly observed in one dataset. Methods commonly used in the literature have important limitations: either they discard information by reducing the dimensionality of the matching, or they do not preserve the multivariate distributions of the variables imported form each dataset. This is especially problematic for studies using the combined dataset to construct measure of inequality. This paper details a statistical matching method using optimal transport theory that does not suffer from these drawbacks. Using data from the Current Population Survey and the IRS Public Use Files, I compare this approach to other methods. I show that the synthetic dataset built with the optimal transport matching method presents higher measures of income inequality.