The glm operator estimates a generalized linear model. Available only in the Enterprise Edition.

Synopsis

glm(matrix, response, weights, distribution, link)

Library

The glm operator resides in the Linear Algebra library. Run this query to load the library:

AFL% load_library('linear_algebra');

Summary

The glm operator estimates a generalized linear model that models the conditional expectation of a random variable Y given a matrix X as a function of a linear combination of the columns of the matrix.

In particular, let Y represent a random variable with a distribution D. Let X represent a model matrix. For certain cases of D, define the generalized linear model:

where E(Y|X) means the conditional expectation of Y on the model matrix X. The invertible function f is called a link function, and the vector b contains the model coefficients.

For more details of the GLM, see Wikipedia's article on Generalized linear model.

Inputs

matrix: <double>[r=..., c=...]: The input r x c model matrix, consisting of a single double-precision attribute. The number of columns must be less than or equal to the number of rows (c<=r). The glm operator requires that all of the columns are contained in a single chunk, with no chunk overlap, and assumes that the model matrix has full column rank. The columns of the model matrix represent model variables and may include data variables, encodings of factor variables, an intercept term and others.
response: <double>[r=...]: The response vector of length r is a measurement of the random variable Y in the definition of the generalized linear model stated above. The glm operator requires the response vector to be a 1-D SciDB array containing a single, double-precision attribute whose dimension schema length, chunk size, and overlap must match those of the row dimension schema of the model matrix.
weights: <double>[r=...]: A vector of length r with user-supplied nonnegative weights for each row of the model matrix. The glm operator requires that the weights vector is a 1-D SciDB array containing a single, double-precision attribute whose dimension schema length, chunk size, and overlap must match those of the row dimension schema of the model matrix. If you do not have a weights vector, construct an r length vector, on-the-fly, where all the values are 1. For example, assuming r=10,000, and a model matrix chunk size of 1000, you could call glm as follows:
```
%AFL glm(mMatrix, response, build(<val:double>[r=0:9999:0:1000],1), 'gamma', 'log'); 
```
distribution: string: The distribution of the random variable Y. The response term is a measurement of this random variable. The following values for the distribution are allowed: gaussian, poisson, binomial, gamma.
link: string: The optional link function f discussed earlier. The following values for the link function are allowed: identity, inverse, log, logit, probit. However you cannot use every link function with every distribution. This table shows the accepted combinations:
identity log inverse logit probit
guassian yes yes yes no no
poisson yes yes no no no
binomial no yes no yes yes
gamma yes yes yes no no
If a distribution is specified but the link function is not specified, a default link function is used. The default link function for the corresponding distribution function is listed in bold in the table above. To use the default, specify an empty string, '', as the argument for the link function.

Outputs

The glm operator returns a 2-D SciDB array with 18 rows and c columns, where c is the number of columns of the input model matrix. Each row of the output matrix represents a distinct output element as follows:

Output Row	Description	Value
row 0	Computed model coefficients (aka beta). The column position corresponds to columns of the input model matrix.	vector
row 1	Model coefficient standard errors.	vector
row 2	Model coefficient score values.	vector
row 3	Model coefficient p values.	vector
row 4	Dispersion.	scalar
row 5	Residual degrees of freedom for the null model, not including an intercept term. If you included an intercept term in your model matrix, subtract one from this value.	scalar
row 6	Residual degrees of freedom.	scalar
row 7	Total number of available observations.	scalar
row 8	Number of nonzero-weighted observations.	scalar
row 9	A value of 1 indicates that the method converged, otherwise not.	scalar
row 10	Number of zero-valued user-supplied weights.	scalar
row 11	Akaike's "an information criterion" (AIC) value.	scalar
row 12	Null deviance.	scalar
row 13	Maximized log likelihood value.	scalar
row 14	Residual deviance.	scalar
row 15	Residual sum of squares.	scalar
row 16	Tolerance.	scalar
row 17	Number of method iterations.	scalar

Limitations

The glm operator's inputs are restricted as follows:

All the columns of the model matrix must be in a single chunk.
Chunk overlap is not supported – specify 0 for the chunk overlap when creating the inputs.
All inputs require a single, double-precision attribute.
The chunking scheme for the rows of the model matrix must match exactly that of the response and weights vectors.

Examples

To demonstrate the glm operator, do the following:

Enter:

AFL% store(
       redimension(
         -- project() will select only the attributes val, r, and c.
         project(
           -- apply() adds new attributes val, r, and c.
           apply(
             -- rng_uniform fills an array with random numbers.
             rng_uniform(
               <val:double>[i=0:49:0:50],0,1,'drand48',1000
             ),
             val, double(int64(rng_uniform*10+1)),
             r, i/5,
             c, i%5
           ),
           val, r, c
         ),
         <val:double NOT NULL>[r=0:9:0:10; c=0:4:0:5]
       ),
       A_glm1
     );

The output is:

{r,c} val
{0,0} 1
{0,1} 10
{0,2} 5
{0,3} 1
{0,4} 1
{1,0} 10
{1,1} 9
{1,2} 7
{1,3} 4
{1,4} 3
{2,0} 4
{2,1} 9
{2,2} 5
{2,3} 1
{2,4} 6
{3,0} 8
{3,1} 8
{3,2} 2
{3,3} 5
{3,4} 2
{4,0} 7
{4,1} 6
{4,2} 3
{4,3} 2
{4,4} 4
{5,0} 4
{5,1} 7
{5,2} 5
{5,3} 9
{5,4} 8
{6,0} 10
{6,1} 8
{6,2} 5
{6,3} 4
{6,4} 6
{7,0} 5
{7,1} 7
{7,2} 4
{7,3} 3
{7,4} 2
{8,0} 6
{8,1} 1
{8,2} 7
{8,3} 10
{8,4} 9
{9,0} 5
{9,1} 3
{9,2} 2
{9,3} 9
{9,4} 2

Enter:

AFL% store(
           project(
                   apply(
                         rng_uniform(<val:double NOT NULL>[r=0:9:0:10],0,1,'drand48',4350),
                         val, rng_uniform),
                   val), A_glm2);

The output is:

{r} val
{0} 0.389679
{1} 0.710839
{2} 0.523507
{3} 0.415214
{4} 0.570819
{5} 0.837386
{6} 0.17413
{7} 0.393471
{8} 0.110539
{9} 0.412687

Enter:

AFL% glm(A_glm1, A_glm2, build(<val:double NOT NULL>[r=0:9:0:10],1), 'gaussian', 'identity');

The output is:

{r,c} value
{0,0} -0.0101387
{0,1} 0.0619054
{0,2} -0.0195615
{0,3} 0.0313987
{0,4} 0.00769056
{1,0} 0.0283337
{1,1} 0.0266538
{1,2} 0.0566258
{1,3} 0.0277375
{1,4} 0.0412659
{2,0} -0.357832
{2,1} 2.32258
{2,2} -0.345452
{2,3} 1.13199
{2,4} 0.186366
{3,0} 0.735079
{3,1} 0.0678355
{3,2} 0.743813
{3,3} 0.308987
{3,4} 0.859483
{4,0} 0.0607375
{5,0} 10
{6,0} 5
{7,0} 10
{8,0} 10
{9,0} 1
{10,0} 0
{11,0} 5.43536
{12,0} 0.438733
{13,0} 3.28232
{14,0} 0.303687
{15,0} 0.303687
{16,0} 0
{17,0} 3

Remove them by entering:

AFL% remove(A_glm1);

and

AFL% remove(A_glm2);

identity	log	inverse	logit	probit
guassian	yes	yes	yes	no	no
poisson	yes	yes	no	no	no
binomial	no	yes	no	yes	yes
gamma	yes	yes	yes	no	no