Monday, December 14, 2015

Professional Blog

Uses of Generalized Linear Models for Linear Regression

Jared Korotney
December 14, 2015

            If you are currently, or will eventually, work as a statistician or another job that involves data and statistical analysis, you should probably be familiar with generalized linear models that allow for the modelling of certain kinds of data as well as using distributions to model data sets that involve positive quantities (ex. the population of a country). 
            The concept of generalized linear models (GLM's) was formulated by John Nelder and Robert Wedderburn, and was originally used bring various types of models together, such as logistic regression models and Poisson regression models. As a form of regression analysis, generalized linear models are quite useful for evaluating dependent variables, each of which is assumed to have been created from a certain type of distribution. In these models, the mean of each distribution is dependent on the independent variables. 
            A GLM will typically consist of three components. The first component is a probability distribution from the exponential family. The most common distribution group in this case is what is known as the "overdispersed exponential family of distributions." This group is also a branch of the exponential dispersion model, which allows this distribution family to use a dispersion parameter that is typically related to the variance of distribution. The second component is the use of a linear predictor, which incorporates the independent variables into the GLM. It is directly related to the expected value of the data (the reason that it is called the "predictor"). However, the third and final component needs to be involved with assisting the linear predictor in evaluating the data in order to use the independent variables, and that component is known as a link function. Link functions are used for comparing the linear predictor with the mean of the data set. Many link functions can be used in these situations, and they are also commonly used for "linking" its domain to the mean range. 
            There are several extensions to the GLM that should be noted. One point to make about the GLM is that in terms of clustered data, it guesses that the information and measurements used in the data are not correlated. The use of extensions in this scenario is for allowing the correlation to be visible in the data among various observations. a second extension to the GLM is a generalized additive model (GAM). These models are used in a manner such that the linear predictor is not restricted to just being linear. 
            Many times, the GLM will be confused with general linear models (general, not generalized). Although the general linear model is also a generalization of the multiple linear regression model, the biggest difference between this and GLMs (although both are abbreviated in the same way) is that the general linear model follows a multiple linear regression pattern, as opposed to the GLM which uses a single linear regression pattern. Also, the GLM is an extension of the general linear model in order for it to incorporate response variables that are a follow-up to any distribution in the exponential distribution family. The general linear model is also encompassed by the GLM, which also allows for the expansion of linear least-squares models.
            With the GLM, there will usually be estimations and tests that must be performed to receive the best and most logical answers. The parameters that are used for estimation can be examined by the "maximum likelihood method", which is a mathematical method used just for estimating the parameters of a data set. In this case, the method will use a selected set of values of a group of model parameters used to maximize the likelihood function, which is simply the statistical model of a given set of parameters.
            Today, many advanced technological systems use the generalized linear model in accordance with many other types of statistical models, with the main models being logistic regression, multiway frequency analysis, logit models, and Poisson regression. In logistic regression, a binary response variable is modeled as a part of a logit link function. Speaking of which, the logit model uses these variables are modeled as "binomial random variables", which are variables that lie in accordance with binomial distribution. Multiway frequency analysis refers to the response variable typically being modeled as a Poisson random variable, which also goes the same for, obviously, the Poisson regression model.