Text Prompt Detection

Feature Description

This operator utilizes the Grounding DINO model to achieve open-vocabulary object detection. This means it doesn’t require pre-training on specific object categories; as long as an object can be described in text, it can potentially be detected. Based on user-input English text prompts (words or sentences), it detects objects in the image corresponding to the text description and outputs their bounding boxes, categories, and confidence scores.

Use Cases

  • Open-Vocabulary Detection: Detect new object categories not included in the training set, as long as they can be described in English. For example, "the red apple on the table" or "person wearing a hat".

  • Fine-Grained Detection: Distinguish objects based on more specific descriptions. For example, distinguishing between "cat" and "dog", or "sedan" and "truck".

  • Specific Attribute Detection: Detect objects with specific attributes. For example, "the open door" or "a blue chair".

Inputs and Outputs

Input Item

Image: The color image to be detected (must be in RGB format). Currently, only single image input is supported.

Output Item

Detection results: A list containing detection results, each result representing a detected instance, containing its bounding box (usually a horizontal box), confidence score, assigned category ID (based on the category mapping parameter), and corresponding polygon contour (same as the bounding box).

Parameter Descriptions

  • This operator relies on the GroundingDINO Python library. If your environment has not yet installed it, please install it from Qianyi’s pypi source using pip install groundingdino.

  • This operator also requires the BERT language model. On the first run, if the environment allows access to Qianyi Technology’s internal model server, no manual configuration is needed; the software will download it automatically. If access is not available, you need to manually download bert-base-uncased.zip, extract it, and then specify the extracted path in the initialization parameters. After successful configuration once, subsequent runs usually do not require specifying this again.

  • English Prompts: Prompts must be in English.

  • Prompt Separator: Use an English period or comma to separate prompts for different detection targets.

  • Category Mapping Alignment: Ensure the number and order of numerical IDs in "Category mapping" correspond one-to-one with the prompts in "Prompt phrase", otherwise it may lead to category assignment confusion or use of default assignments.

  • Input Image: Ensure the input is a color RGB image.

  • Single Image Processing: The current operator implementation only supports processing one image at a time.

Weight files

Parameter Description

Specifies the Grounding DINO model weight file (usually .pth format) to be used for detection. A valid model file must be selected. Available models include Swin-T and Swin-B, official models can be downloaded from specific paths.

Parameter Tuning Guide

Select a model that matches your task requirements and hardware capabilities. Swin-B models are generally larger and may have higher accuracy but are slower; Swin-T models are relatively smaller and faster.

Enable GPU

Parameter Description

Select whether to use GPU for model inference computation. If checked, ensure the computer has an available NVIDIA graphics card and the corresponding CUDA environment. Defaults to off.

Parameter Tuning Guide

Checking this option can significantly improve processing speed. If no compatible GPU or insufficient GPU memory is available, this should be unchecked.

BERT language model path

Parameter Description

Specifies the path to the local BERT language model folder.

Parameter Tuning Guide

  • First run: If you can access the internal model server, no additional configuration is needed; keep the default value.

  • If unable to connect to the internal service: You need to manually download bert-base-uncased.zip from http://10.10.10.98:9000 or other sources, extract it, and then select the path to the extracted folder in this parameter.

  • Subsequent runs: Once the model has been successfully loaded (either automatically or manually), this parameter usually does not need to be set again, unless the model files are moved or deleted.

Prompt phrase

Parameter Description

Input English words or sentences to guide detection. The model will attempt to find objects in the image that match this description. This parameter must be provided.

Parameter Tuning Guide

Use clear, specific English descriptions. To detect multiple different object categories, you can separate different prompts/phrases with an English period or comma. For example: "chair . table . person" or "red car, blue bike". The quality of the prompts directly affects detection performance.

Category mapping

Parameter Description

Maps objects detected corresponding to the prompts to your specified numerical category IDs (integers starting from 0). This is very useful for subsequent filtering or processing based on category.

Parameter Tuning Guide

  • Input a string of numbers separated by English periods or commas. These numbers will sequentially correspond to the individual prompts separated by delimiters in the "Prompt phrase". For example, if the prompt is "cat . dog . bird", the category mapping could be "0 . 1 . 2", so detected cats will be labeled as category 0, dogs as category 1, and birds as category 2.

  • If this parameter is empty, or if the number of mappings provided is less than the number of prompts, the operator will automatically assign category IDs 0, 1, 2, …​ in order. Ensure the number of mappings is at least equal to the number of prompts.

Test box confidence threshold

Parameter Description

Bounding box confidence score threshold for filtering detection results. Only detection boxes with scores higher than this threshold will be retained.

Parameter Tuning Guide

Increasing this value results in fewer detections, retaining only highly confident targets and reducing false positives. Decreasing it yields more detections, potentially including less confident targets but also increasing false positives. Adjust based on actual performance.

Parameter Range

[0,1], Default value: 0.3

Detection category confidence threshold

Parameter Description

Text-image matching confidence score threshold for filtering detection results. It measures how well the detected region matches the corresponding text prompt.

Parameter Tuning Guide

Increasing this value requires a higher semantic match between the detection result and the text prompt, helping to filter out results that may have high bounding box confidence but are less relevant to the prompt. Decreasing this value allows results with slightly lower matching scores to pass.

Parameter Range

[0, 1], Default value: 0.25