Lexical variation on social media is a growing research area, but it is often complicated by the messy nature of social media data, which can make it hard to control for different explanatory factors and to know whether results obtained on a particular user sample generalize to another sample. Another outstanding methodological challenge in this area is the bottom-up discovery of sociolinguistic variables. In contrast with traditional sociolinguistic studies, most large-scale social media studies to date have studied differences across contexts in the frequencies of individual terms. If we instead analyse the relative frequencies of different realisations of the same lexical variable, we can be more confident that any effects we observe are effects on how people are choosing to refer to things, and not on which things they are choosing to refer to. The contributions of this thesis are two large-scale studies of factors which condition lexical variation in Scottish tweets, and a system to facilitate efficient, data-driven curation of lexical sociolinguistic variables.