The Cluster and Outlier Analysis identifies hot spots, cold spots, and spatial outliers that are significant for crowding, and uses the Anselin Local Moran's I (local Morans) statistic to analyze the weighted elements.
Scatter plot is a common method used to show the correlation between two variables in data analysis. Represents the Spatial Autocorrelation relationship of a variable. Moran scatter plot can be used.
Moran scatter plot
Moran scatter plot can be used to explore the global pattern of Spatial Association, identify spatial anomalies and local non-stationarity. The observed value of the variable at each position is expressed on the horizontal axis, and its spatial lag (the standardized local Spatial Autocorrelation index Moran's I) is expressed on the vertical axis, then the correlation between the two can be vividly shown by the scattered points in the coordinate system.
- The Moran scatter plot is divided into four quadrants, corresponding to four different types of local Spatial Association patterns:
- Upper
- right quadrant (H-H): the observation Zi is greater than mean (high), and its spatial lag is also greater than mean (high). Lower
- left quadrant (L-L): the observation Zi is smaller than mean (low), and its spatial lag is also smaller than mean (low). Upper
- left quadrant (L-H): the observation Zi is less than mean (low), but its spatial lag is greater than mean (high). Lower
- right quadrant (H-L): The observation Zi is larger than mean (high), but its spatial lag is smaller than mean (low).
Meaning of spatial relationship represented by - different quadrants:
- The upper
- right quadrant (H-H) and the lower left quadrant (L-L) correspond to a positive Spatial Autocorrelation, indicating similarity between observations at that location and observations from surrounding neighbors. The upper right quadrant (H-H) corresponds to high-high similarity, and the lower left quadrant (L-L) corresponds to low-low similarity. The upper
- left quadrant (L-H) and the lower right quadrant (H-L) correspond to a negative Spatial Autocorrelation, indicating that the observed values at this location are different from those of the surrounding neighbors. Where the upper left quadrant (L-H) corresponds to low-high dissimilarity, the lower right quadrant (H-L) is high-low dissimilarity, i.e., low values are surrounded by surrounding high values, and high values are surrounded by surrounding low values. The upper
- right and lower left quadrants correspond to a positive Spatial Association (high-high) and a negative Spatial Association (low-low) in the G statistic, respectively. Looking at the relative density of the upper right and lower left quadrants, you can see to what extent the global Spatial Association pattern is determined by the association between high and low values.
- Looking at the relative densities of the upper left and lower right quadrants, one can see which form of negative Spatial Association is dominant.
- In addition, potential spatial anomalies can be found by looking at the upper left and lower right quadrants of the Moran scatter plot. A circle with a radius of 2 is drawn with the center point of the quadrant of the scatter diagram as the center, and the observation points outside the circle can be considered as outliers. This is because, the Moran scatter plot is constructed with a standardized variable and its spatial lag, and a distance of 2 units on the plot means a deviation from the mean Two Standard Deviations, which can be regarded as an outlier.
The Moran significance map is obtained - when only those observations that are significantly high or significantly low are displayed in the Moran scatter map. If significant observations belong to the first or third quadrant in the scatter plot, significant spatial clustering is considered to exist; if they belong to the second or fourth quadrant, significant spatial differences are considered to exist.
Application case
- Where is the clearest boundary between the rich and poor areas in the
- study area. Where are the anomalous consumption patterns in the
- study area. Where is the unexpected high incidence of diabetes in the
- study area.
Function entrance
- Spatial Statistical Analysis tab-& gt; Clustering Distribution-& gt; Cluster and Outlier Analysis. (iDesktopX)
- Toolbox-& gt Spatial Statistical Analysis-& gt Clustering Distribution-& gt Cluster and Outlier Analysis. (iDesktopX)
Main parameters
- Source Data: Set the Vector Dataset to be analyzed, which supports three types of Dataset: point, line and surface.
- Evaluation Field: Set the Property Field value of the analysis element involved in the analysis. Only numerical fields are supported.
- Conceptual Model: The selection should reflect the inherent relationships between the features to be analyzed. Set the way features interact with each other in space. The more realistic the model, the more accurate the results will be.
- Fixed Distance: Applicable to point data and face data with large size change.
- Polygon Adjacent (Common Edges/Intersect): Applies to face data with adjacent edges and intersections.
- Polygon Adjacent (Node/Common Edges/Intersect): Applies to face data with adjacent points, adjacent edges, and intersections.
- Inverse Distance: All features are treated as neighbors to all other features. All features contribute to the target feature, but the contribution decreases as the distance increases. Features are weighted as a fraction of the distance. Applies to continuous data.
- Inverse Distance Square: Similar to the Inverse Distance ", influence decreases more rapidly as distance increases, and the weight between elements is one part of the square of the distance.
- K Nearest Neighbors: The K elements closest to the target element are included in the calculation of the target element (with a weight of 1), and the remaining elements are excluded from the calculation of the target element (with a weight of 0). This option is useful if you want to ensure that you have a minimum number of adjacent features for analysis. This method works well when the distribution of the data varies over the study area such that some elements are distant from all others. When the proportion of fixed analysis is not as important as fixed adjacent Records, the K nearest neighbor method is more suitable.
- Spatial Weight Matrix: a Spatial Weight Matrix File is required. Spatial weights are numbers that reflect the distance, time, or other costs between each element and any other element in the Dataset. If you want to model the accessibility of urban services, for example, to find areas of urban crime concentration, it is a good way to model spatial relationships with the help of networks. Before analysis, create a Spatial Weight Matrix File (.swmb) using the Generate Cyberspace Weights Tool, and then specify the full path to the SWMB file you created.
- Undifferentiated Region: This model is a combination of Inverse Distance "and Fixed Distance" that treats each feature as an adjacent feature to each other. This option is not suitable for large Datasets. Features within the specified fixed distance range have equal weight (weight of 1); features outside the specified fixed distance range have less influence as the distance increases.
- Break Distance Tolerance: "-1" indicates that the default distance is calculated and applied. This default value is to ensure that each feature has at least one adjacent feature. "0" indicates that no distance is applied, and each feature is an adjacent feature. A non-zero positive value indicates adjacent features when the distance between features is less than this value.
- Inverse Distance Power Index: An index that controls the importance of the distance value. The higher the power value, the smaller the influence of the distance.
- Number of Adjacent Elements: Set a positive integer, indicating that the nearest K elements around the target element are adjacent elements.
- Measure Distance Method: The Measure Distance method uses Euclidean distance and Manhattan distance. Detail Description for Euclidean Distance and Manhattan Distance. Refer to the Basic Vocabulary of Spatial Statistical Analysis .
- Spatial Weights Matrix Standard ization: When the distribution of the elements may deviate due to the sampling design or the applied aggregation scheme, Spatial Weights Matrix Standard ization is recommended. When you select a Spatial Weights Matrix Standard ization, each weight is divided by the sum of the rows (the sum of the weights of all adjacent features). Weighting of Spatial Weights Matrix Standard ization is typically used in conjunction with fixed distance neighboring features, and is almost always used for neighboring features based on face adjacency. This reduces the bias that occurs when an element has a different number of adjacent elements. The Spatial Weights Matrix Standard ization will scale all weights between 0 and 1, creating a relative (rather than absolute) weight scheme. You may want to select the Spatial Weights Matrix Standard ization "option whenever you are working with a face feature that represents an administrative boundary.
- FDR Correction or not: If False Discovery Rate (FDR) Correction, then statistical significance will be based on False Discovery Rate Correction. Otherwise, statistical significance will be based on P-value and Z-score fields.
results
Analyst Result is Region Dataset, There will be four Property Fields: ALMI _ MoranI (local Morans), ALMI _ Zscore (z-score), ALMI _ Pvalue (p-value), and ALMI _ Type (cluster and outlier types). Map will render the content of the ALMI _ Type field, and the statistic histogram of the evaluation field will be displayed in the statistic chart window. The specific meaning of each field is explained as follows:
Since clustering and outliers are based on 95% confidence, the ALMI _ Type field has a value only if the P value is less than 0.05. If the False Discovery Rate (FDR) correction is applied, the statistical significance will be based on the confidence of the correction (lowering the p-value threshold from 0.05 to some new value) to accommodate multiple testing and spatial dependencies.
The meaning ofP-value (ALMI _ Pv alue) | Morans(ALMI_MoranI) | representation | Clustering and outlier types (AIMI _ TYPE) |
---|---|---|---|
P<0.05 | M>0 | Represented as a high or low value cluster | HH (high value cluster) or LL (low value cluster) |
P<0.05 | M<0 | Indicated as outliers | HL (low around high) or LH (high around low) |
P-value (ALMI _ Pv alue) | Z Score (ALMI _ Zscore) | representation | Clustering and outlier types (AIMI _ TYPE) |
---|---|---|---|
P<0.05 | Z>0 | Indicates that the surrounding features have similar values (high or low). | HH (high value cluster) or LL (low value cluster) |
P<0.05 | Z<0 | Indicates that there is a statistically significant Spatial Data outlier. | HL (low around high) or LH (high around low) |
Instance
Case Data: Click here to download the ClusterDistributions case data . After downloading, unzip it for use.
Clustering and outlier analysis were performedon the number of cases in 2013 from the county data of viral hepatitis (Pneumonia), and the evaluation field was set as the number of cases in 2013, and the conceptual model was Inverse Distance. The Measure Distance method is Euclidean Distance, which standardizes the spatial weight matrix, selects FDR correction, and defaults to others. The Result Dataset property table is as follows:
Under the assumption of Random distribution, the results show that:
- The Z value was significant
- in the northwest red area, and the number of cases in this area showed obvious high-value spatial clustering. The small spatial difference in the
- region and the high level of the region itself and its surroundings (HH) indicate that the number of patients with viral hepatitis in the region is more than that in other regions. In addition to the region itself with more patients, its surroundings are also the region with more patients. Medical facilities in red areas need to prevent an increase in the number of cases of viral hepatitis.
Therefore, the Moran's I of most areas in this region is not significant, and the significant part shows high value clustering.
The statistical histogram for themorbidity field is as follows: