Bigdata Tools Partition Parameter Description

When using Bigdata Tools in GPA tools and employing distributed read/write operations, you may need to set partition parameters in the "Advance Settings" of the tool. These include Partition Field, Partition Count, ID Field, Partition Condition Array, etc. The following section describes these parameters.

When these partition parameters are not set in the tool's "Advance Settings", the primary key field of the dataset is used as the partition field by default, and the partition count equals the number of CPU cores.

Parameter Description

Parameter Name Parameter Description
Partition Field Specify a field as the partition field. This field must be numeric (integer or float).
Partition Count Specify the number of partitions, which is related to the computing resources of the cluster. For reference: (1) Based on the CPU core count of the cluster, the partition count is usually twice the number of cores; (2) The data volume within each partition should be controlled between 20,000 and 50,000.
ID Field Specify a field from the source dataset to serve as the unique identifier field for the FeatureRDD. The values in this field must be unique.
Partition Condition Array You can set partition conditions to customize partitions. For details, see "Partition Condition Array Input Description" below.

Partition Condition Array Input Description

  • The Partition Condition Array must ensure that all field values are included within the partition ranges to avoid data loss. The ranges should be set based on the principle of even data distribution.
  • Multiple partition ranges are connected by commas (,). The upper and lower bounds of a single partition are connected by "and".
  • When you set a "Partition Field", you can customize the range intervals in the "Partition Condition Array". For example, if the "Partition Field" is set to an ID Field that increments sequentially from 1 to 100, you can divide the ID Field into three partitions in the "Partition Condition Array": [1, 30), [30, 60], [60, 100].
  • When you have not set a "Partition Field", you can directly input partition conditions in the "Partition Condition Array". Here, any field can be used as a partition condition, but it must satisfy the following:
    1. The data type and field values should be easily definable into ranges and comparable, such as numeric or date-type fields, which are suitable for use as partition condition fields.
    2. The partition field values do not necessarily need to be unique, but should be distributed as evenly as possible to ensure load balance across partitions.
    3. The selection of the partition field should align with business logic and query patterns. For example, if queries or filtering operations are frequently performed based on a specific field, setting that field as the partition field can improve query efficiency.

Example: As shown in the figure below, the primary key is a varchar-type UUID. Through analysis, a date-type field "created_at" is selected as the partition field. Partition one range: ('2024-10-29 09:10:25.000', '2024-10-29 09:10:28.000']; Partition two range: ['2024-10-29 09:10:28.000', '2024-10-29 09:10:47.447']. These two partition ranges cover all data records.

The Partition Condition Array can be entered as: created_at >= '2024/10/29 09:20:25.000' and created_at < '2024/10/29 09:20:28.000', created_at >= '2024/10/29 09:20:28.000' and created_at <= '2024/10/29 09:20:47.447'; or entered as: created_at < '2024/10/29 09:20:28.000', created_at >= '2024/10/29 09:20:28.000'.