read vector data

Instructions for Use

Connect multiple data sources and extract data into feature datasets (FeatureRDD) to prepare for subsequent analysis. SuperMap provides a wealth of spatial analysis tools for FeatureRDD. For details, see "Big Data Vector Analysis Tools".

Feature dataset (FeatureRDD) is the basic data model used by iObjects for Spark, and is the entry point for data reading, storage and analysis. The tool supports docking with multiple data sources, and currently supports:

  • File data sources: UDB/UDBX, SimpleJson, CSV, GDB, ShapeFile, DSF.
  • Database data sources: Dameng, HUAWEI CLOUD PostgreSQL, Hangao, Shenzhou GM, Yugong, Kongtian, Renda Jincang, Oracle, PostGIS, PostgreSQL, SQLSpatial, SQLPlus, MySQL, MongoDB, ElasticSearch,ArcSDE for Oracle.

While reading data, the tool also supports GeoSQL capabilities. By writing simple ECQL statements for data filtering and querying, it can read data on demand and reduce the computing pressure on working nodes.

Parameter Description

parameter name default value parameter definition parameter type
Connection information   The connection information to access the data source needs to include the data source type, connection parameters, data set name and other information. Use the '--key=value' method to set, multiple values ​​are separated by ' 'spaces. For example:
1. Connect DSF data (HDFS directory):
--providerType=dsf --path=hdfs ://127.0.0.1:9000/data/vector/DLTB
or:
dsf --path=hdfs://127.0.0.1:9000/data/vector/DLTB
2. Connect DSF data (local directory):
Linux system:
--providerType=dsf --path=file:////home/data/vector/Zoo_pt or dsf --path=file:////home /data/vector/Zoo_pt
Windows system:
--providerType=dsf --path=file:///E:/data/vectordata/Zoo_pt or dsf --path=file:///E :/data/vectordata/Zoo_pt
3. Connect GDB data:
--providerType=gdb --path=file:///F:/data/landuse2k/GDB/landuse.gdb --table=DLTB
4. Connect to OracleSpatial data source:
--providerType=jdbc :
jdbc --host=127.0.0.1 --port=1521 --schema=testosp --database=orcl --user=testosp --password=testosp --dbtype=oracle --table=SMDTV (It should be noted that sometimes the data set name is inconsistent with the data set table name, and the data set table name needs to be filled in here.)
--providerType=sdx :
--providerType=sdx --server=127.0.0.1:1521/orcl --user=testosp --password=testosp --maxConnPoolNum=1 --dataset=SMDTV --dbType=ORACLESPATIAL
5. Connect to PostGIS data:
--providerType=jdbc :
--providerType=jdbc --host=127.0.0.1 --port=5432 --schema=postgres --database=uitest --user=postgres -- password=uitest --dbtype=postgis --dataset=DLTB
--providerType=sdx :
--providerType=sdx --server=127.0.0.1 --database=postgis --user=postgres --password =uitest --maxConnPoolNum=10 --dataset=DLTB --dbType=PGGIS
6. Connect UDB/UDBX data:
sdx --server=F:\data\landuse2k\UDB\landuse.udb -- dataset=DLTB --dbType=udb
7. Connect ShapeFile data:
shape-file --path=file:///F:/data/landuse2k/shp
8. Connect Elasticsearch data:< br>--providerType=elastic --index=test --table=test --nodes=localhost --port=9200
9. Connect to ArcSDE for Oracle data:
--providerType=sdx --server=127.0.0.1:3452/xe --user=testdb --password=testdb --alias=test --maxConnPoolNum=1 --dataset=dt --dbType=ARCSDE_ORACLE
String
Data query conditions
(Optional)
  Data query conditions support attribute conditions and spatial relationship queries, for example: SmID<100, BBOX(the_geom, 120,30,121,31), DLMC IN ('forest land', 'Orchard', 'Woodland'). String

output result

The output of the Read Vector Data tool is a feature dataset (FeatureRDD).

Detailed interpretation of data source connection information

--providerType=dsf

parameter name parameter definition
--path, -P, --url DSF directory address, you need to add file://
--bounds, -B The query bounds of geographic range query, invalid when as-dsf-rdd is true
--fields, --result-fields Field names to be read
--as-dsf-rdd Whether to read as DSFFeatureRDD, the default is false. If it is true, it will be read by DSFFeatureRDD. When making the spatial relationship or spatial judgment of two datasets, it is necessary to ensure that the partition indexes of the two DSFFeatureRDD datasets are consistent. Otherwise, the user can set it to false and read by FeatureRDD. , when performing spatial operations or judgments on two data sets, a partition index will be built uniformly

--providerType=csv

parameter name parameter definition
--path, -P, --url
(required)
csv file address, need to add file://
--fields, -F The fields to be imported are separated by commas. When the csv has a header, pass in the field name and set FirstRowIsField to true. When there is no header, pass in col0, col1, etc.
--geofields, --geo-fields, -GF The spatial information fields to be imported, separated by commas. Pass in one column for line or area data, two columns for point data, more will be invalid. When the csv has a header, pass in the field name and set FirstRowIsField to true; when there is no header, pass in col0, col1, etc. according to the column order
--firstRowIsField, --first-as-field, -FAF Whether the first row is the header, as the field name
--idField, --id-field, -IF Specify the id field name, if not specified, the default is uuid
--modifyDesFieldTypes, --des-field-type, -DFT Specify the field type of the field, if not specified, W defaults to read the field as a string type, and the input format is 'field name->field type, field name- >field type'
--timeZoneOffset,--time-Zone-Offset,-TZO When the field type specified in the modifyDesFieldTypes parameter is date, the default is UTC time, which can be used to convert the time difference. The format is "+/-hh:mm: ss". For example, convert to Beijing time: --timeZoneOffset=+08:00:00, which means UTC+8 hours.

--providerType=gdb

parameter name parameter definition
--path, -P, --url
(required)
GDB folder address, you need to add file://
--dataset, --table
(required)
The name of the dataset to be read in GDB
--minPartitions, --min-part, -M Spark RDD minimum number of partitions, default 0

--providerType=jdbc

parameter name parameter definition
--dbtype, --type, -T, --db-type
(required)
Supports PostGIS and OracleSpatial database types, and can realize distributed reading and writing in cluster mode
--dataset, --table
(required)
The name of the table to be read
--host
(required)
Database service address
--port
(required)
Database service port number, the default is 0
--database, --db The database name of the data source connection
--schema The database schema to access
--user, -U Datasource connection database name
--password, --pwd Password to log in to the database
--numPartitions, --num-parts, -N Number of partitions, default is 0
--partitionField, --part-field Field name for partition
--predicates partition condition array

--providerType=sdx

parameter name parameter definition
--server, -S
(required)
Database engine service address or file path
--dbType, --type, -T, --db-type
(required)
Database engine type, supports udb/udbx, SQLPlus, MongoDB, Dameng, Yugong, Huawei cloud database PostgreSQL, etc. database engine type
--dataset, --dt
(required)
The name of the dataset to be read
--driver The name of the driver required for the data source connection
--database, --db The database name of the data source connection
--alias, -A Datasource alias
--user, -U Datasource