read vector data

Instructions for Use

Connect multiple data sources and extract data into feature datasets (FeatureRDD) to prepare for subsequent analysis. SuperMap provides a wealth of spatial analysis tools for FeatureRDD. For details, see "Big Data Vector Analysis Tools".

Feature dataset (FeatureRDD) is the basic data model used by iObjects for Spark, and is the entry point for data reading, storage and analysis. The tool supports docking with multiple data sources, and currently supports:

File data sources: UDB/UDBX, SimpleJson, CSV, GDB, ShapeFile, DSF.
Database data sources: Dameng, HUAWEI CLOUD PostgreSQL, Hangao, Shenzhou GM, Yugong, Kongtian, Renda Jincang, Oracle, PostGIS, PostgreSQL, SQLSpatial, SQLPlus, MySQL, MongoDB, ElasticSearch,ArcSDE for Oracle.

While reading data, the tool also supports GeoSQL capabilities. By writing simple ECQL statements for data filtering and querying, it can read data on demand and reduce the computing pressure on working nodes.

Note: When selecting the ArcSDE_Oracle datasource, if the data contains binary-type attribute fields, the data reading will fail. Please ensure the data does not inclue binary attribute data before proceeding.

Parameter Description

parameter name	default value	parameter definition	parameter type
Connection information		The connection information to access the data source needs to include the data source type, connection parameters, data set name and other information. Use the '--key=value' method to set, multiple values are separated by ' 'spaces. For example: 1. Connect DSF data (HDFS directory): --providerType=dsf --path=hdfs ://127.0.0.1:9000/data/vector/DLTB or: dsf --path=hdfs://127.0.0.1:9000/data/vector/DLTB 2. Connect DSF data (local directory): Linux system: --providerType=dsf --path=file:////home/data/vector/Zoo_pt or dsf --path=file:////home /data/vector/Zoo_pt Windows system: --providerType=dsf --path=file:///E:/data/vectordata/Zoo_pt or dsf --path=file:///E :/data/vectordata/Zoo_pt 3. Connect GDB data: --providerType=gdb --path=file:///F:/data/landuse2k/GDB/landuse.gdb --table=DLTB 4. Connect to OracleSpatial data source: --providerType=jdbc : jdbc --host=127.0.0.1 --port=1521 --schema=testosp --database=orcl --user=testosp --password=testosp --dbtype=oracle --table=SMDTV (It should be noted that sometimes the data set name is inconsistent with the data set table name, and the data set table name needs to be filled in here.) --providerType=sdx : --providerType=sdx --server=127.0.0.1:1521/orcl --user=testosp --password=testosp --maxConnPoolNum=1 --dataset=SMDTV --dbType=ORACLESPATIAL 5. Connect to PostGIS data: --providerType=jdbc : --providerType=jdbc --host=127.0.0.1 --port=5432 --schema=postgres --database=uitest --user=postgres -- password=uitest --dbtype=postgis --dataset=DLTB --providerType=sdx : --providerType=sdx --server=127.0.0.1 --database=postgis --user=postgres --password =uitest --maxConnPoolNum=10 --dataset=DLTB --dbType=PGGIS 6. Connect UDB/UDBX data: sdx --server=F:\data\landuse2k\UDB\landuse.udb -- dataset=DLTB --dbType=udb 7. Connect ShapeFile data: shape-file --path=file:///F:/data/landuse2k/shp 8. Connect Elasticsearch data:< br>--providerType=elastic --index=test --table=test --nodes=localhost --port=9200 9. Connect to ArcSDE for Oracle data: --providerType=sdx --server=127.0.0.1:3452/xe --user=testdb --password=testdb --alias=test --maxConnPoolNum=1 --dataset=dt --dbType=ARCSDE_ORACLE	String
Data query conditions (Optional)		Data query conditions support attribute conditions and spatial relationship queries, for example: SmID<100, BBOX(the_geom, 120,30,121,31), DLMC IN ('forest land', 'Orchard', 'Woodland').	String

output result

The output of the Read Vector Data tool is a feature dataset (FeatureRDD).

Detailed interpretation of data source connection information

--providerType=dsf

parameter name	parameter definition
--path, -P, --url	DSF directory address, you need to add file://
--bounds, -B	The query bounds of geographic range query, invalid when as-dsf-rdd is true
--fields, --result-fields	Field names to be read
--as-dsf-rdd	Whether to read as DSFFeatureRDD, the default is false. If it is true, it will be read by DSFFeatureRDD. When making the spatial relationship or spatial judgment of two datasets, it is necessary to ensure that the partition indexes of the two DSFFeatureRDD datasets are consistent. Otherwise, the user can set it to false and read by FeatureRDD. , when performing spatial operations or judgments on two data sets, a partition index will be built uniformly

--providerType=csv

parameter name	parameter definition
--path, -P, --url (required)	csv file address, need to add file://
--fields, -F	The fields to be imported are separated by commas. When the csv has a header, pass in the field name and set FirstRowIsField to true. When there is no header, pass in col0, col1, etc.
--geofields, --geo-fields, -GF	The spatial information fields to be imported, separated by commas. Pass in one column for line or area data, two columns for point data, more will be invalid. When the csv has a header, pass in the field name and set FirstRowIsField to true; when there is no header, pass in col0, col1, etc. according to the column order
--firstRowIsField, --first-as-field, -FAF	Whether the first row is the header, as the field name
--idField, --id-field, -IF	Specify the id field name, if not specified, the default is uuid
--modifyDesFieldTypes, --des-field-type, -DFT	Specify the field type of the field, if not specified, W defaults to read the field as a string type, and the input format is 'field name->field type, field name- >field type'
--timeZoneOffset,--time-Zone-Offset,-TZO	When the field type specified in the modifyDesFieldTypes parameter is date, the default is UTC time, which can be used to convert the time difference. The format is "+/-hh:mm: ss". For example, convert to Beijing time: --timeZoneOffset=+08:00:00, which means UTC+8 hours.

--providerType=gdb

parameter name	parameter definition
--path, -P, --url (required)	GDB folder address, you need to add file://
--dataset, --table (required)	The name of the dataset to be read in GDB
--minPartitions, --min-part, -M	Spark RDD minimum number of partitions, default 0

--providerType=jdbc

parameter name	parameter definition
--dbtype, --type, -T, --db-type (required)	Supports PostGIS and OracleSpatial database types, and can realize distributed reading and writing in cluster mode
--dataset, --table (required)	The name of the table to be read
--host （required）	Database service address
--port (required)	Database service port number, the default is 0
--database, --db	The database name of the data source connection
--schema	The database schema to access
--user, -U	Datasource connection database name
--password, --pwd	Password to log in to the database
--numPartitions, --num-parts, -N	Number of partitions, default is 0
--partitionField, --part-field	Field name for partition
--predicates	partition condition array

--providerType=sdx

parameter name	parameter definition
--server, -S (required)	Database engine service address or file path
--dbType, --type, -T, --db-type (required)	Database engine type, supports udb/udbx, SQLPlus, MongoDB, Dameng, Yugong, Huawei cloud database PostgreSQL, etc. database engine type
--dataset, --dt (required)	The name of the dataset to be read
--driver	The name of the driver required for the data source connection
--database, --db	The database name of the data source connection
--alias, -A	Datasource alias
--user, -U	Datasource