Text Prompt Detection
Feature Description
This operator utilizes the Grounding DINO model to achieve open-vocabulary object detection. This means it doesn’t require pre-training on specific object categories; as long as an object can be described in text, it can potentially be detected. Based on user-input English text prompts (words or sentences), it detects objects in the image corresponding to the text description and outputs their bounding boxes, categories, and confidence scores.
Use Cases
-
Open-Vocabulary Detection: Detect new object categories not included in the training set, as long as they can be described in English. For example, "the red apple on the table" or "person wearing a hat".
-
Fine-Grained Detection: Distinguish objects based on more specific descriptions. For example, distinguishing between "cat" and "dog", or "sedan" and "truck".
-
Specific Attribute Detection: Detect objects with specific attributes. For example, "the open door" or "a blue chair".
Inputs and Outputs
Input Item |
Image: The color image to be detected (must be in RGB format). Currently, only single image input is supported. |
Output Item |
Detection results: A list containing detection results, each result representing a detected instance, containing its bounding box (usually a horizontal box), confidence score, assigned category ID (based on the category mapping parameter), and corresponding polygon contour (same as the bounding box). |
Parameter Descriptions
|
|
Weight files
Parameter Description |
Specifies the Grounding DINO model weight file (usually .pth format) to be used for detection. A valid model file must be selected. Available models include Swin-T and Swin-B, official models can be downloaded from specific paths. |
Parameter Tuning Guide |
Select a model that matches your task requirements and hardware capabilities. Swin-B models are generally larger and may have higher accuracy but are slower; Swin-T models are relatively smaller and faster. |
Enable GPU
Parameter Description |
Select whether to use GPU for model inference computation. If checked, ensure the computer has an available NVIDIA graphics card and the corresponding CUDA environment. Defaults to off. |
Parameter Tuning Guide |
Checking this option can significantly improve processing speed. If no compatible GPU or insufficient GPU memory is available, this should be unchecked. |
BERT language model path
Parameter Description |
Specifies the path to the local BERT language model folder. |
Parameter Tuning Guide |
|
Prompt phrase
Parameter Description |
Input English words or sentences to guide detection. The model will attempt to find objects in the image that match this description. This parameter must be provided. |
Parameter Tuning Guide |
Use clear, specific English descriptions. To detect multiple different object categories, you can separate different prompts/phrases with an English period or comma. For example: "chair . table . person" or "red car, blue bike". The quality of the prompts directly affects detection performance. |
Category mapping
Parameter Description |
Maps objects detected corresponding to the prompts to your specified numerical category IDs (integers starting from 0). This is very useful for subsequent filtering or processing based on category. |
Parameter Tuning Guide |
|
Test box confidence threshold
Parameter Description |
Bounding box confidence score threshold for filtering detection results. Only detection boxes with scores higher than this threshold will be retained. |
Parameter Tuning Guide |
Increasing this value results in fewer detections, retaining only highly confident targets and reducing false positives. Decreasing it yields more detections, potentially including less confident targets but also increasing false positives. Adjust based on actual performance. |
Parameter Range |
[0,1], Default value: 0.3 |
Detection category confidence threshold
Parameter Description |
Text-image matching confidence score threshold for filtering detection results. It measures how well the detected region matches the corresponding text prompt. |
Parameter Tuning Guide |
Increasing this value requires a higher semantic match between the detection result and the text prompt, helping to filter out results that may have high bounding box confidence but are less relevant to the prompt. Decreasing this value allows results with slightly lower matching scores to pass. |
Parameter Range |
[0, 1], Default value: 0.25 |