🔷Text Prompt Detection

Function Description

This operator utilizes the Grounding DINO model to achieve open-vocabulary object detection, which means it doesn’t need to pre-train specific categories of objects - as long as they can be described in text, they can potentially be detected. Based on English text prompts (words or sentences) input by users, it detects objects in images that correspond to the text descriptions and outputs their bounding boxes, categories, and confidence scores.

Usage Scenarios

Open-vocabulary detection: Detect new category objects not included in the training set, as long as they can be described in English. For example, "the red apple on the table" or "person wearing a hat".
Fine-grained detection: Distinguish objects based on more specific descriptions. For example, distinguish between "cat" and "dog", or "sedan" and "truck".
Specific attribute detection: Detect objects with specific attributes. For example, "the open door" or "a blue chair".

Input Output

Input

Image: Color image requiring detection (needs to be in RGB format). Currently only supports single image input.

GroundDinoDetect input

Output

Detection Results: List containing detection results, each result represents a detected instance, containing its bounding box (usually horizontal box), confidence score, assigned category ID (based on category mapping parameter), and corresponding polygon contour (same as bounding box).

GroundDinoDetect output

Parameter Description

English prompts: Must use English for prompts.
Prompt word separation: Use English periods or commas to separate multiple different detection target prompts.
Category mapping alignment: Ensure the number and order of numeric IDs in "Category Mapping" correspond one-to-one with the prompts in "Prompt Sentence", otherwise it may cause category assignment confusion or use default assignment.
Input image: Ensure the input is a color RGB image.
Single image processing: The current operator implementation only supports processing one image at a time.

Weight File

Parameter Description

Specify the Grounding DINO model weight file (usually in .pth format) for detection. A valid model file must be selected. Available models include Swin-T and Swin-B. Official models can be downloaded from specific paths.

Parameter Adjustment

Choose a model that matches your task requirements and hardware capabilities. The Swin-B model is usually larger with potentially higher accuracy but slower speed; the Swin-T model is relatively smaller and faster.

Enable GPU

Parameter Description

Choose whether to use GPU for model inference computation. If checked, ensure the computer has a compatible NVIDIA graphics card and corresponding CUDA environment. Default is off.

Parameter Adjustment

Checking this option can significantly improve processing speed. If there’s no compatible GPU or insufficient VRAM, this should be unchecked.

BERT Language Model Path

Parameter Description

Specify the path to the local BERT language model folder.

Parameter Adjustment

The operator depends on the BERT language model. If the environment allows access to the internal model server, no manual configuration is needed - the software will automatically download. If access is unavailable, you need to manually download bert-base-uncased.zip and extract it, then specify the extracted path in the initialization parameters. BERT language model download addresses:

http://10.10.10.98:9000/inference/groundingdino/groundingdino_swint_ogc.pth

http://10.10.10.98:9000/inference/groundingdino/groundingdino_swinb_cogcoor.pth

Subsequent runs: Once the model has been successfully loaded once (automatically or manually), this parameter usually doesn’t need to be set again unless the model file is moved or deleted.

Prompt Sentence

Parameter Description

Input English words or sentences for guiding detection. The model will attempt to find objects in the image that match this description. This parameter must be provided.

Parameter Adjustment

Use clear, specific English descriptions. If you want to detect multiple different object categories, you can separate different prompt words/phrases with English periods or commas. For example: "chair . table . person" or "red car, blue bike". The quality of prompts directly affects detection performance.

Category Mapping

Parameter Description

Map detected objects corresponding to prompts to numeric category IDs you specify (integers starting from 0). This is very useful for subsequent filtering or processing based on categories.

Parameter Adjustment

Input a series of numbers separated by English periods or commas. These numbers will correspond in order to each prompt separated by delimiters in "Prompt Sentence". For example, if the prompts are "cat . dog . bird", the category mapping can be "0 . 1 . 2", so detected cats will be labeled as category 0, dogs as category 1, and birds as category 2.

If this parameter is empty, or the number of provided mappings is less than the number of prompts, the operator will automatically assign category IDs 0, 1, 2, … in order. Ensure the number of mappings is at least equal to the number of prompts.

Detection Box Confidence Threshold

Parameter Description

Bounding box confidence score threshold for filtering detection results. Only detection boxes with scores above this threshold will be retained.

Parameter Adjustment

Increasing this value will result in fewer detection results, keeping only targets the model is very confident about, reducing false positives. Decreasing this value will get more detection results, possibly including some less confident targets, but may also increase false positives. Adjust according to actual performance.

Parameter Range

[0,1], Default: 0.3

Detection Category Confidence Threshold

Parameter Description

Text-image matching confidence score threshold for filtering detection results. It measures the degree of matching between detected regions and corresponding text prompts.

Parameter Adjustment

Increasing this value requires higher semantic matching between detection results and text prompts, helping filter out some results that have high detection box confidence but are not very relevant to the prompt words. Decreasing this value allows results with slightly lower matching degrees to pass.

Parameter Range

[0, 1], Default: 0.25