Geographically Weighted Regression Analysis

Geographically weighted regression (GWR) is a new Spatial Analysis method proposed in recent years, which is a local regression model. It detects the non-stationarity of spatial relationships by embedding spatial structures into a linear regression model. Through regression analysis, we can model, examine and explore spatial relationships, explain the factors behind observed spatial patterns, and predict these phenomena.

The analysis principle and application scenario can be viewed on the Regression analysis page.

Function entrance

  • Spatial Statistical Analysis tab-> Spatial Relationship Modeling-> Geographically Weighted Regression Analysis. (iDesktopX)
  • Toolbox-> Spatial Statistical Analysis-> Spatial Relationship Modeling-> Geographically Weighted Regression Analysis. (iDesktopX)

Main parameters

  • Source Data: Set the Vector Dataset to be analyzed, which supports three types of Dataset: point, line and surface. Note: The Object Count in the source data should be greater than 20.
  • Explanatory Fields:: The explanatory variable is the independent variable, that is, X in the regression equation, which is used to model or predict the value of the dependent variable. For example, we wanted to look at the many causes of obesity and see if there was a correlation between obesity and factors such as income, healthy food intake, education level, and so on. In this example, obesity is the dependent variable (Y), and factors such as income, healthy food intake, and education level are the explanatory variables (X).
  • Kernel function type: Set the calculation function type of the distance weight between two points. The following five kernel function types are supported. In the calculation formula of each function, W _ IJ is the weight between point I and point J. D _ IJ is the distance between point I and point J, and B is the bandwidth range.
    • Quadratic Kernel: W _ IJ = (1- (d _ IJ/B) ^ 2)) ^ 2 if d _ IJ ≤ B; otherwise W _ IJ = 0.
    • Boxcar Kernel: W _ IJ = 1 if d _ IJ ≤ B; otherwise W _ IJ = 0.
    • Gaussian Kernel: Calculated as W _ IJ = e ^ (- ( (d _ IJ/B) ^ 2)/2).
    • Tricube Kernel: W _ IJ = (1- (d _ IJ/B) ^ 3)) ^ 3 if d _ IJ ≤ B; W _ IJ = 0 otherwise.
  • Modeling field: dependent variable, namely, the variable to be studied and predicted. Only numerical value field is supported.
  • Bandwidth Mode: Set the determination mode of the analysis bandwidth range. The following three determination modes are supported:
    • Akaike Information Criterion (AICc): Use the Akaike Information Criterion (AICc) to determine the bandwidth range, which is suitable for use when the distance or the number of adjacent elements parameters are uncertain.
    • Validate: The bandwidth range is determined by using the validate method. Validate does not include the regression point itself when estimating the regression coefficient, that is, the regression calculation is performed only based on the data points around the regression point. This value is the difference between the estimate and the actual value obtained in Validate for each regression point, and the sum of their squares is the CV value. Use when the distance or number of adjacent features parameter is uncertain.
    • Fixed Distance or Number of Neighbors: Determines the bandwidth range based on a fixed distance or number of neighbors. You must set a value for Distance or Number of Neighbors.
  • Bandwidth type: Fixed Band width and Variable Band width are provided:
    • Fixed Band width: If the bandwidth method selected by the user is fixed distance or adjacent number, it is necessary to set "bandwidth range" and specify a value as fixed distance; If the bandwidth method selected by the user is AICc or Validate, the user does not need to specify the distance, and the program can calculate a fixed distance value based on the data.
    • Variable Band width: If the bandwidth method selected by the user is variable distance or adjacent number, the "adjacent number" needs to be set, and Application will take the distance between the regression point and the nearest adjacent point as the bandwidth range; If the bandwidth method selected by the user is AICc or Validate, the user does not need to specify the number of adjacent points, and the program can find the adjacent points according to the data and calculate a fixed distance value.
  • Prediction Item: The geographically weighted regression prediction can be performed according to the analysis result.
    • Forecast Data Settings: Set the Datasource where the forecast data will be saved and the Dataset Name.
    • Field mapping: indicates the correspondence between the Field of the Prediction Dataset and the Field of the Source Dataset. If not set, all Explanatory Fields: in the Source Dataset must be present in the Prediction Dataset.
    • Forecast Result Data Settings: Set the Datasource and Dataset Name where the Forecast Result Data will be saved.
  • Result Settings: Set the Datasource and Dataset Name where the Result Data will be saved.

Result output

After setting the above parameters, click the "OK" button in the dialog box to execute the Geographically Weighted Regression Analysis. Result Dataset contains Result Property Field: Validate (CVScore), Predicted, Regression Coefficient (Intercept, C1/_ Explanatory Fields: Name), Residual, Standard Errors (StdError), Coefficient Standard Errors (SE/_ Intercept, SE1/_ Explanatory Fields: name), Dummy t Values (TV _ Intercept, TV 1 _ Explanatory Fields: name) and Studentised residual (StdResidual). As shown in the following figure:

  • Validate (CVScore): This value is the difference between the estimated value and the actual value in Validate for each regression point, and the sum of their squares is the CV value. As an indicator of model performance.

  • Predicted: These values are estimates (or fits) from geographically weighted regression.

  • Regression coefficient (Intercept): It is the regression coefficient of the geographically weighted regression model. It is the regression intercept of the regression model and represents the predicted value of the dependent variable when all explanatory variables are zero.

  • Regression coefficient (C1 _ _ Explanatory Fields: name): This is the regression coefficient of Explanatory Fields: and indicates the strength and type of relationship between the explanatory and dependent variables. If the regression coefficient is positive, the relationship between the explanatory variable and the dependent variable is positive; otherwise, there is a negative relationship. If the relationship is strong, the regression coefficient is also relatively large; if the relationship is weak, the regression coefficient is close to 0.

  • Residual: These are the unexplained portion of the dependent variable, the difference between the estimated and actual values. Standardized Residual has a mean of 0 and a standard deviation of 1. Residuals can be used to determine the degree of fit of the model. A small residual indicates that the model fits well and can explain most of the predicted values, indicating that the regression equation is valid.

  • Standard Errors (StdError): Standard errors of the estimates, which measure the reliability of each estimate. Smaller Standard Errors indicate that the smaller the difference between the fitted value and the actual value, the better the model fits.

  • Coefficient Standard Errors (SE _ Intercept, SE1 _ _ Explanatory Fields: name): These values are used to measure the reliability of each regression coefficient estimate. Estimates are more reliable when the Standard Errors of the coefficients are smaller than actual coefficients. Large Standard Errors may indicate a local multicollinearity problem.

  • Pseudo-t value (TV _ Intercept, TV 1 _ _ Explanatory Fields: name): It is the significance test of each regression coefficient. When the T value is greater than critical value, Reject the null hypothesis, the regression coefficient is significant, that is, the estimation of the regression system is reliable; when the T value is less than critical value, Accept the null hypothesis, the regression coefficient is not significant.

  • Studentised residual (StdResidual): the ratio of residual to Standard Errors, which can be used to judge whether the data is abnormal. If the data are all in the interval of (-2,2), it indicates that the data are normal and the variance is homogeneous; If the data exceeds the interval of (-2,2), it indicates that the data is abnormal data without homogeneity of variance and normality.

    The following figure shows the comparison between the predicted value and the actual value of the Geographically Weighted Regression Analysis, based on the housing prices in the downtown area of Beijing in 2016, with the number of subway stations around and the length of subway lines as explanatory variables. Simulate the housing price in the downtown area of Beijing in 2016. The orange line in the figure below is the actual housing price and the blue line is the fitted housing price.

After successful execution, the Output Window will output the Result Info of this analysis, as shown in the following figure. If the residual error of the Analyst Result is within the acceptable range of the user, the result of the event can be predicted according to the result and the explanatory variables.

At this time, according to the result attribute value of the Geographically Weighted Regression Analysis and the length of the new subway station and subway line in 2017, the housing price in the downtown area of Beijing in 2017 can be predicted. The calculation formula is: 2017 forecast house price = Intercept (regression coefficient) + C1 (Explanatory Fields: 1 regression coefficient) * 2017 subway station + C2 (Explanatory Fields: 2 regression coefficient) * 2017 subway line length. According to the formula, the housing prices in the central urban area of Beijing in 2017 can be predicted according to the change information of the subway. The results are shown in the figure below. From the results, we can see that the housing prices in the areas where the subway lines have changed fluctuate to a certain extent.

Related topics

Regression analysis

Ordinary Least Squares

Basic vocabulary

Measuring Geographic Distributions

Clustering Distribution

Analysis Mode