read vector data
Instructions for Use
Connect multiple data sources and extract data into feature datasets (FeatureRDD) to prepare for subsequent analysis. SuperMap provides a wealth of spatial analysis tools for FeatureRDD. For details, see "Big Data Vector Analysis Tools".
Feature dataset (FeatureRDD) is the basic data model used by iObjects for Spark, and is the entry point for data reading, storage and analysis. The tool supports docking with multiple data sources, and currently supports:
- File data sources: UDB/UDBX, SimpleJson, CSV, GDB, ShapeFile, DSF.
- Database data sources: Dameng, HUAWEI CLOUD PostgreSQL, Hangao, Shenzhou GM, Yugong, Kongtian, Renda Jincang, Oracle, PostGIS, PostgreSQL, SQLSpatial, SQLPlus, MySQL, MongoDB, ElasticSearch,ArcSDE for Oracle.
While reading data, the tool also supports GeoSQL capabilities. By writing simple ECQL statements for data filtering and querying, it can read data on demand and reduce the computing pressure on working nodes.
Parameter Description
parameter name | default value | parameter definition | parameter type |
---|---|---|---|
Connection information | The connection information to access the data source needs to include the data source type, connection parameters, data set name and other information. Use the '--key=value' method to set, multiple values are separated by ' 'spaces. For example: 1. Connect DSF data (HDFS directory): --providerType=dsf --path=hdfs ://127.0.0.1:9000/data/vector/DLTB or: dsf --path=hdfs://127.0.0.1:9000/data/vector/DLTB 2. Connect DSF data (local directory): Linux system: --providerType=dsf --path=file:////home/data/vector/Zoo_pt or dsf --path=file:////home /data/vector/Zoo_pt Windows system: --providerType=dsf --path=file:///E:/data/vectordata/Zoo_pt or dsf --path=file:///E :/data/vectordata/Zoo_pt 3. Connect GDB data: --providerType=gdb --path=file:///F:/data/landuse2k/GDB/landuse.gdb --table=DLTB 4. Connect to OracleSpatial data source: --providerType=jdbc : jdbc --host=127.0.0.1 --port=1521 --schema=testosp --database=orcl --user=testosp --password=testosp --dbtype=oracle --table=SMDTV (It should be noted that sometimes the data set name is inconsistent with the data set table name, and the data set table name needs to be filled in here.) --providerType=sdx : --providerType=sdx --server=127.0.0.1:1521/orcl --user=testosp --password=testosp --maxConnPoolNum=1 --dataset=SMDTV --dbType=ORACLESPATIAL 5. Connect to PostGIS data: --providerType=jdbc : --providerType=jdbc --host=127.0.0.1 --port=5432 --schema=postgres --database=uitest --user=postgres -- password=uitest --dbtype=postgis --dataset=DLTB --providerType=sdx : --providerType=sdx --server=127.0.0.1 --database=postgis --user=postgres --password =uitest --maxConnPoolNum=10 --dataset=DLTB --dbType=PGGIS 6. Connect UDB/UDBX data: sdx --server=F:\data\landuse2k\UDB\landuse.udb -- dataset=DLTB --dbType=udb 7. Connect ShapeFile data: shape-file --path=file:///F:/data/landuse2k/shp 8. Connect Elasticsearch data:< br>--providerType=elastic --index=test --table=test --nodes=localhost --port=9200 9. Connect to ArcSDE for Oracle data: --providerType=sdx --server=127.0.0.1:3452/xe --user=testdb --password=testdb --alias=test --maxConnPoolNum=1 --dataset=dt --dbType=ARCSDE_ORACLE |
String | |
Data query conditions (Optional) |
Data query conditions support attribute conditions and spatial relationship queries, for example: SmID<100, BBOX(the_geom, 120,30,121,31), DLMC IN ('forest land', 'Orchard', 'Woodland'). | String |
output result
The output of the Read Vector Data tool is a feature dataset (FeatureRDD).
Detailed interpretation of data source connection information
--providerType=dsf
parameter name | parameter definition |
---|---|
--path, -P, --url | DSF directory address, you need to add file:// |
--bounds, -B | The query bounds of geographic range query, invalid when as-dsf-rdd is true |
--fields, --result-fields | Field names to be read |
--as-dsf-rdd | Whether to read as DSFFeatureRDD, the default is false. If it is true, it will be read by DSFFeatureRDD. When making the spatial relationship or spatial judgment of two datasets, it is necessary to ensure that the partition indexes of the two DSFFeatureRDD datasets are consistent. Otherwise, the user can set it to false and read by FeatureRDD. , when performing spatial operations or judgments on two data sets, a partition index will be built uniformly |
--providerType=csv
parameter name | parameter definition |
---|---|
--path, -P, --url (required) |
csv file address, need to add file:// |
--fields, -F | The fields to be imported are separated by commas. When the csv has a header, pass in the field name and set FirstRowIsField to true. When there is no header, pass in col0, col1, etc. |
--geofields, --geo-fields, -GF | The spatial information fields to be imported, separated by commas. Pass in one column for line or area data, two columns for point data, more will be invalid. When the csv has a header, pass in the field name and set FirstRowIsField to true; when there is no header, pass in col0, col1, etc. according to the column order |
--firstRowIsField, --first-as-field, -FAF | Whether the first row is the header, as the field name |
--idField, --id-field, -IF | Specify the id field name, if not specified, the default is uuid |
--modifyDesFieldTypes, --des-field-type, -DFT | Specify the field type of the field, if not specified, W defaults to read the field as a string type, and the input format is 'field name->field type, field name- >field type' |
--timeZoneOffset,--time-Zone-Offset,-TZO | When the field type specified in the modifyDesFieldTypes parameter is date, the default is UTC time, which can be used to convert the time difference. The format is "+/-hh:mm: ss". For example, convert to Beijing time: --timeZoneOffset=+08:00:00, which means UTC+8 hours. |
--providerType=gdb
parameter name | parameter definition |
---|---|
--path, -P, --url (required) |
GDB folder address, you need to add file:// |
--dataset, --table (required) |
The name of the dataset to be read in GDB |
--minPartitions, --min-part, -M | Spark RDD minimum number of partitions, default 0 |
--providerType=jdbc
parameter name | parameter definition |
---|---|
--dbtype, --type, -T, --db-type (required) |
Supports PostGIS and OracleSpatial database types, and can realize distributed reading and writing in cluster mode |
--dataset, --table (required) |
The name of the table to be read |
--host (required) |
Database service address |
--port (required) |
Database service port number, the default is 0 |
--database, --db | The database name of the data source connection |
--schema | The database schema to access |
--user, -U | Datasource connection database name |
--password, --pwd | Password to log in to the database |
--numPartitions, --num-parts, -N | Number of partitions, default is 0 |
--partitionField, --part-field | Field name for partition |
--predicates | partition condition array |
--providerType=sdx
parameter name | parameter definition |
---|---|
--server, -S (required) |
Database engine service address or file path |
--dbType, --type, -T, --db-type (required) |
Database engine type, supports udb/udbx, SQLPlus, MongoDB, Dameng, Yugong, Huawei cloud database PostgreSQL, etc. database engine type |
--dataset, --dt (required) |
The name of the dataset to be read |
--driver | The name of the driver required for the data source connection |
--database, --db | The database name of the data source connection |
--alias, -A | Datasource alias |
--user, -U | Datasource |