glm
The glm operator estimates a generalized linear model. Available only in the Enterprise Edition.
Synopsis
glm(matrix, response, weights, distribution, link)
Library
The glm operator resides in the Linear Algebra library. Run this query to load the library:
AFL% load_library('linear_algebra');
Summary
The glm operator estimates a generalized linear model that models the conditional expectation of a random variable Y given a matrix X as a function of a linear combination of the columns of the matrix.
In particular, let Y represent a random variable with a distribution D. Let X represent a model matrix. For certain cases of D, define the generalized linear model:
where E(Y|X) means the conditional expectation of Y on the model matrix X. The invertible function f is called a link function, and the vector b contains the model coefficients.
For more details of the GLM, see Wikipedia's article on Generalized linear model.
Inputs
- matrix: <double>[r=..., c=...]: The input r x c model matrix, consisting of a single double-precision attribute. The number of columns must be less than or equal to the number of rows (c<=r). The glm operator requires that all of the columns are contained in a single chunk, with no chunk overlap, and assumes that the model matrix has full column rank. The columns of the model matrix represent model variables and may include data variables, encodings of factor variables, an intercept term and others.
response: <double>[r=...]: The response vector of length r is a measurement of the random variable Y in the definition of the generalized linear model stated above. The glm operator requires the response vector to be a 1-D SciDB array containing a single, double-precision attribute whose dimension schema length, chunk size, and overlap must match those of the row dimension schema of the model matrix.
weights: <double>[r=...]: A vector of length r with user-supplied nonnegative weights for each row of the model matrix. The glm operator requires that the weights vector is a 1-D SciDB array containing a single, double-precision attribute whose dimension schema length, chunk size, and overlap must match those of the row dimension schema of the model matrix. If you do not have a weights vector, construct an r length vector, on-the-fly, where all the values are 1. For example, assuming r=10,000, and a model matrix chunk size of 1000, you could call glm as follows:
%AFL glm(mMatrix, response, build(<val:double>[r=0:9999:0:1000],1), 'gamma', 'log');
distribution: string: The distribution of the random variable Y. The response term is a measurement of this random variable. The following values for the distribution are allowed: gaussian, poisson, binomial, gamma.
link: string: The optional link function f discussed earlier. The following values for the link function are allowed: identity, inverse, log, logit, probit. However you cannot use every link function with every distribution. This table shows the accepted combinations:
identity log inverse logit probit guassian yes yes yes no no poisson yes yes no no no binomial no yes no yes yes gamma yes yes yes no no If a distribution is specified but the link function is not specified, a default link function is used. The default link function for the corresponding distribution function is listed in bold in the table above. To use the default, specify an empty string, '', as the argument for the link function.
Outputs
The glm operator returns a 2-D SciDB array with 18 rows and c columns, where c is the number of columns of the input model matrix. Each row of the output matrix represents a distinct output element as follows:
Output Row | Description | Value |
---|---|---|
row 0 | Computed model coefficients (aka beta). The column position corresponds to columns of the input model matrix. | vector |
row 1 | Model coefficient standard errors. | vector |
row 2 | Model coefficient score values. | vector |
row 3 | Model coefficient p values. | vector |
row 4 | Dispersion. | scalar |
row 5 | Residual degrees of freedom for the null model, not including an intercept term. If you included an intercept term in your model matrix, subtract one from this value. | scalar |
row 6 | Residual degrees of freedom. | scalar |
row 7 | Total number of available observations. | scalar |
row 8 | Number of nonzero-weighted observations. | scalar |
row 9 | A value of 1 indicates that the method converged, otherwise not. | scalar |
row 10 | Number of zero-valued user-supplied weights. | scalar |
row 11 | Akaike's "an information criterion" (AIC) value. | scalar |
row 12 | Null deviance. | scalar |
row 13 | Maximized log likelihood value. | scalar |
row 14 | Residual deviance. | scalar |
row 15 | Residual sum of squares. | scalar |
row 16 | Tolerance. | scalar |
row 17 | Number of method iterations. | scalar |
Limitations
The glm operator's inputs are restricted as follows:
- All the columns of the model matrix must be in a single chunk.
- Chunk overlap is not supported – specify 0 for the chunk overlap when creating the inputs.
- All inputs require a single, double-precision attribute.
- The chunking scheme for the rows of the model matrix must match exactly that of the response and weights vectors.
Examples
To demonstrate the glm operator, do the following:
Enter:
AFL% store( redimension( -- project() will select only the attributes val, r, and c. project( -- apply() adds new attributes val, r, and c. apply( -- rng_uniform fills an array with random numbers. rng_uniform( <val:double>[i=0:49:0:50],0,1,'drand48',1000 ), val, double(int64(rng_uniform*10+1)), r, i/5, c, i%5 ), val, r, c ), <val:double NOT NULL>[r=0:9:0:10; c=0:4:0:5] ), A_glm1 );
The output is:
{r,c} val {0,0} 1 {0,1} 10 {0,2} 5 {0,3} 1 {0,4} 1 {1,0} 10 {1,1} 9 {1,2} 7 {1,3} 4 {1,4} 3 {2,0} 4 {2,1} 9 {2,2} 5 {2,3} 1 {2,4} 6 {3,0} 8 {3,1} 8 {3,2} 2 {3,3} 5 {3,4} 2 {4,0} 7 {4,1} 6 {4,2} 3 {4,3} 2 {4,4} 4 {5,0} 4 {5,1} 7 {5,2} 5 {5,3} 9 {5,4} 8 {6,0} 10 {6,1} 8 {6,2} 5 {6,3} 4 {6,4} 6 {7,0} 5 {7,1} 7 {7,2} 4 {7,3} 3 {7,4} 2 {8,0} 6 {8,1} 1 {8,2} 7 {8,3} 10 {8,4} 9 {9,0} 5 {9,1} 3 {9,2} 2 {9,3} 9 {9,4} 2
Enter:
AFL% store( project( apply( rng_uniform(<val:double NOT NULL>[r=0:9:0:10],0,1,'drand48',4350), val, rng_uniform), val), A_glm2);
The output is:{r} val {0} 0.389679 {1} 0.710839 {2} 0.523507 {3} 0.415214 {4} 0.570819 {5} 0.837386 {6} 0.17413 {7} 0.393471 {8} 0.110539 {9} 0.412687
Enter:
AFL% glm(A_glm1, A_glm2, build(<val:double NOT NULL>[r=0:9:0:10],1), 'gaussian', 'identity');
The output is:{r,c} value {0,0} -0.0101387 {0,1} 0.0619054 {0,2} -0.0195615 {0,3} 0.0313987 {0,4} 0.00769056 {1,0} 0.0283337 {1,1} 0.0266538 {1,2} 0.0566258 {1,3} 0.0277375 {1,4} 0.0412659 {2,0} -0.357832 {2,1} 2.32258 {2,2} -0.345452 {2,3} 1.13199 {2,4} 0.186366 {3,0} 0.735079 {3,1} 0.0678355 {3,2} 0.743813 {3,3} 0.308987 {3,4} 0.859483 {4,0} 0.0607375 {5,0} 10 {6,0} 5 {7,0} 10 {8,0} 10 {9,0} 1 {10,0} 0 {11,0} 5.43536 {12,0} 0.438733 {13,0} 3.28232 {14,0} 0.303687 {15,0} 0.303687 {16,0} 0 {17,0} 3
Remove them by entering:
AFL% remove(A_glm1);
andAFL% remove(A_glm2);