Logistic regression training

Instructions for Use

Logistic regression is a widely used classification machine learning algorithm used to estimate the likelihood of event occurrence. Although the name of logistic regression carries the word 'regression', it is actually a classification method mainly used for binary classification problems, where there are only two outputs representing two categories. For example, we want to predict whether a patient will recover, whether customers will purchase products, and so on. This method is used for the data training process of logistic regression, which can obtain models based on data features and be used for prediction. Return to the summary of the logistic regression model: *IRCharacteristics: The attributes of a logistic regression model. *Variable: An array of field names in a logistic regression model, referring to the fields of the independent variables in the training model. *Mse: Mean square error, the mean squared error between the predicted value and the true value. *RMSE: Root mean square error, the mean square root of the error between the predicted value and the true value. *Mae: mean absolute error, the mean value of the absolute value of the error between the predicted value and the true value. *R2: Determination coefficient. Based on the value of r2, the quality of the model can be judged, with a value range of [0,1]. Generally speaking, a larger r2 indicates a better fitting effect of the model. R2 reflects approximately how accurate it is, because as the number of samples increases, r2 will inevitably increase, and cannot truly quantitatively explain the accuracy level. It can only be roughly quantified. *Explained Variance: Explained variance. *NumIterations: The actual number of iterations.

Parameter Description

Parameter Name Default Value Parameter Definition Parameter Type
Modeling field
  Modeling field name. The field used to train the model, i.e. the dependent variable. This field corresponds to the known (training) value of the variable that will be used for prediction at an unknown location. The modeling fields in this method are classified integers String
Explanation field
  Explanation field name set. This set inputs one or more field names from the training dataset as explanatory variables for the model String
Distance explanatory variable dataset
(Optional)
  Distance explanatory variable dataset, where the objects in the array are constructed using ExplanatoryDistanceRDD. The distance explanatory variable dataset includes the distance explanatory variable dataset and the search distance. Calculate the closest distance between the given training explanatory variable dataset and the input training dataset, and automatically create a column of explanatory variables (the explanatory variable name is the input distanceFieldName). If the training distance explanatory variable dataset is input, then when using the model for prediction, the prediction distance explanatory variable dataset must be input, which corresponds to the name of the explanatory variable created during model training. The prediction distance explanatory variable dataset uses the same search distance ExplanatoryDistanceRDD
Maximum number of iterations
(Optional)
100 Maximum number of iterations, must be greater than 0 Integer
Regular term parameter
(Optional)
0.0 Regular term parameter controls the proportion of loss function and penalty term to prevent overfitting in the training process. The value is greater than 0 Double
Regularization selection method
(Optional)
0.0 Select the regularization method. 0.0 is L2 regularization, 1.0 is L1 regularization, and the value range is [0.0,1.0]. The main function of regularization is to alleviate the overfitting problem of the model Double
Model save directory
(Optional)
  The specified save directory for the logistic regression model String
Training dataset
  The connection information for accessing data needs to include information such as data type, connection parameters, dataset name, etc. Set using the '-- key=value' method, with multiple values separated by '' spaces. If the HBase data is connected as -- providerType=hbase -- hbase. zookeepers=192.168.12.34:2181-- hbase. catalog=demo -- dataset=dltb; Connect dsf data as -- providerType=dsf -- path= hdfs://ip:9000/dsfdata ; The local data is -- providerType=dsf -- path=/home/dsfdata String
Data query conditions
(Optional)
  Data query conditions, which support attribute conditions and Spatial query, such as SmID<100 and BBOX (the_geom, 120,30121,31) String