Tracking-Anything-with-DEVA copied to clipboard
RuntimeWarning: Trying to segment without any memory
Hi, I am getting a Runtime warning; Trying to segment without any memory!, while I am trying to run the file. And after the completion of the code, the outputs are not segmented. The outputs are simply the input images. I tried to play with the parameters also to reduce memory usage. Another problem is when I run the demo_with_text file, I get this log
UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484809662/work/aten/src/ATen/native/TensorShape.cpp:2894.)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
final text_encoder_type: bert-base-uncased
Configuration: {'model': './saves/DEVA-propagation.pth', 'output': './example/output', 'save_all': False, 'amp': True, 'key_dim': 64, 'value_dim': 512, 'pix_feat_dim': 512, 'disable_long_term': False, 'max_mid_term_frames': 10, 'min_mid_term_frames': 5, 'max_long_term_elements': 10000, 'num_prototypes': 128, 'top_k': 30, 'mem_every': 5, 'chunk_size': 4, 'size': 480, 'GROUNDING_DINO_CONFIG_PATH': './saves/', 'GROUNDING_DINO_CHECKPOINT_PATH': './saves/groundingdino_swint_ogc.pth', 'DINO_THRESHOLD': 0.35, 'DINO_NMS_THRESHOLD': 0.8, 'SAM_ENCODER_VERSION': 'vit_h', 'SAM_CHECKPOINT_PATH': './saves/sam_vit_h_4b8939.pth', 'MOBILE_SAM_CHECKPOINT_PATH': './saves/', 'SAM_NUM_POINTS_PER_SIDE': 64, 'SAM_NUM_POINTS_PER_BATCH': 64, 'SAM_PRED_IOU_THRESHOLD': 0.88, 'SAM_OVERLAP_THRESHOLD': 0.8, 'img_path': './example/vipseg/images', 'detection_every': 5, 'num_voting_frames': 3, 'temporal_setting': 'semionline', 'max_missed_detection_count': 10, 'max_num_objects': -1, 'prompt': '', 'sam_variant': 'original', 'enable_long_term': True, 'enable_long_term_count_usage': False}
0%| | 0/2 [00:00<?, ?it/s]UserWarning: None of the inputs have requires_grad=True. Gradients will be None
0%| | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/soicroot/Downloads/Khan/Lane/DEVA/demo/", line 64, in <module>
process_frame(deva, gd_model, sam_model, im_path, result_saver, ti, image_np=frame)
File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/autograd/", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/", line 58, in process_frame_with_text
mask, segments_info = make_segmentation_with_text(cfg, image_np, gd_model, sam_model,
File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/", line 25, in make_segmentation_with_text
mask, segments_info = segment_with_text(cfg, gd_model, sam_model, image_np, prompts, min_side)
File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/", line 73, in segment_with_text
detections = gd_model.predict_with_classes(image=cv2.cvtColor(image, cv2.COLOR_RGB2BGR),
File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/util/", line 195, in predict_with_classes
boxes, logits, phrases = predict(
File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/util/", line 67, in predict
outputs = model(image[None], captions=[caption])
File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/nn/modules/", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/", line 313, in forward
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/nn/modules/", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/", line 248, in forward
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
#define NAN __int_as_float(0x7fffffff)
typedef long long int int64_t;
typedef unsigned int uint32_t;
typedef signed char int8_t;
typedef unsigned char uint8_t; // NOTE: this MUST be "unsigned char"! "char" is equivalent to "signed char"
typedef short int16_t;
static_assert(sizeof(int64_t) == 8, "expected size does not match");
static_assert(sizeof(uint32_t) == 4, "expected size does not match");
static_assert(sizeof(int8_t) == 1, "expected size does not match");
constexpr int num_threads = 128;
constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
constexpr int block_work_size = thread_work_size * num_threads;
//TODO use _assert_fail, because assert is disabled in non-debug builds
#define ERROR_UNSUPPORTED_CAST assert(false);
namespace std {
using ::signbit;
using ::isfinite;
using ::isinf;
using ::isnan;
using ::abs;
using ::acos;
using ::acosf;
using ::asin;
using ::asinf;
using ::atan;
using ::atanf;
using ::atan2;
using ::atan2f;
using ::ceil;
using ::ceilf;
using ::cos;
using ::cosf;
using ::cosh;
using ::coshf;
using ::exp;
using ::expf;
using ::fabs;
using ::fabsf;
using ::floor;
using ::floorf;
using ::fmod;
using ::fmodf;
using ::frexp;
using ::frexpf;
using ::ldexp;
using ::ldexpf;
using ::log;
using ::logf;
using ::log10;
using ::log10f;
using ::modf;
using ::modff;
using ::pow;
using ::powf;
using ::sin;
using ::sinf;
using ::sinh;
using ::sinhf;
using ::sqrt;
using ::sqrtf;
using ::tan;
using ::tanf;
using ::tanh;
using ::tanhf;
using ::acosh;
using ::acoshf;
using ::asinh;
using ::asinhf;
using ::atanh;
using ::atanhf;
using ::cbrt;
using ::cbrtf;
using ::copysign;
using ::copysignf;
using ::erf;
using ::erff;
using ::erfc;
using ::erfcf;
using ::exp2;
using ::exp2f;
using ::expm1;
using ::expm1f;
using ::fdim;
using ::fdimf;
using ::fmaf;
using ::fma;
using ::fmax;
using ::fmaxf;
using ::fmin;
using ::fminf;
using ::hypot;
using ::hypotf;
using ::ilogb;
using ::ilogbf;
using ::lgamma;
using ::lgammaf;
using ::llrint;
using ::llrintf;
using ::llround;
using ::llroundf;
using ::log1p;
using ::log1pf;
using ::log2;
using ::log2f;
using ::logb;
using ::logbf;
using ::lrint;
using ::lrintf;
using ::lround;
using ::lroundf;
using ::nan;
using ::nanf;
using ::nearbyint;
using ::nearbyintf;
using ::nextafter;
using ::nextafterf;
using ::remainder;
using ::remainderf;
using ::remquo;
using ::remquof;
using ::rint;
using ::rintf;
using ::round;
using ::roundf;
using ::scalbln;
using ::scalblnf;
using ::scalbn;
using ::scalbnf;
using ::tgamma;
using ::tgammaf;
using ::trunc;
using ::truncf;
} // namespace std
// NB: Order matters for this macro; it is relied upon in
// _promoteTypesLookup and the serialization format.
// Note, some types have ctype as void because we don't support them in codegen
_(uint8_t, Byte) /* 0 */ \
_(int8_t, Char) /* 1 */ \
_(int16_t, Short) /* 2 */ \
_(int, Int) /* 3 */ \
_(int64_t, Long) /* 4 */ \
_(at::Half, Half) /* 5 */ \
_(float, Float) /* 6 */ \
_(double, Double) /* 7 */ \
_(std::complex<at::Half>, ComplexHalf) /* 8 */ \
_(std::complex<float>, ComplexFloat) /* 9 */ \
_(std::complex<double>, ComplexDouble) /* 10 */ \
_(bool, Bool) /* 11 */ \
_(void, QInt8) /* 12 */ \
_(void, QUInt8) /* 13 */ \
_(void, QInt32) /* 14 */ \
_(at::BFloat16, BFloat16) /* 15 */ \
_(uint8_t, Byte) \
_(int8_t, Char) \
_(int16_t, Short) \
_(int, Int) \
_(int64_t, Long) \
_(at::Half, Half) \
_(float, Float) \
_(double, Double) \
_(std::complex<at::Half>, ComplexHalf) \
_(std::complex<float>, ComplexFloat) \
_(std::complex<double>, ComplexDouble) \
_(bool, Bool) \
_(at::BFloat16, BFloat16)
enum class ScalarType : int8_t {
#define DEFINE_ENUM(_1, n) n,
template <typename T, int size>
struct Array {
T data[size];
__device__ T operator[](int i) const {
return data[i];
__device__ T& operator[](int i) {
return data[i];
Array() = default;
Array(const Array&) = default;
Array& operator=(const Array&) = default;
__device__ Array(T x) {
for (int i = 0; i < size; i++) {
data[i] = x;
template <typename T>
struct DivMod {
T div;
T mod;
__device__ DivMod(T _div, T _mod) {
div = _div;
mod = _mod;
//<unsigned int>
struct IntDivider {
IntDivider() = default;
__device__ inline unsigned int div(unsigned int n) const {
unsigned int t = __umulhi(n, m1);
return (t + n) >> shift;
__device__ inline unsigned int mod(unsigned int n) const {
return n - div(n) * divisor;
__device__ inline DivMod<unsigned int> divmod(unsigned int n) const {
unsigned int q = div(n);
return DivMod<unsigned int>(q, n - q * divisor);
unsigned int divisor; // d above.
unsigned int m1; // Magic number: m' above.
unsigned int shift; // Shift amounts.
template <int NARGS>
struct TrivialOffsetCalculator {
// The offset for each argument. Wrapper around fixed-size array.
// The offsets are in # of elements, not in bytes.
Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
Array<unsigned int, NARGS> offsets;
#pragma unroll
for (int arg = 0; arg < NARGS; arg++) {
offsets[arg] = linear_idx;
return offsets;
template<int NARGS>
struct OffsetCalculator {
OffsetCalculator() = default;
__device__ __forceinline__ Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
Array<unsigned int, NARGS> offsets;
#pragma unroll
for (int arg = 0; arg < NARGS; ++arg) {
offsets[arg] = 0;
#pragma unroll
for (int dim = 0; dim < 25; ++dim) {
if (dim == dims) {
auto divmod = sizes_[dim].divmod(linear_idx);
linear_idx = divmod.div;
#pragma unroll
for (int arg = 0; arg < NARGS; ++arg) {
offsets[arg] += divmod.mod * strides_[dim][arg];
//printf("offset calc thread dim size stride offset %d %d %d %d %d %d %d %d\n",
//threadIdx.x, dim, sizes_[dim].divisor, strides_[dim][0], offsets[0], linear_idx, divmod.div, divmod.mod);
return offsets;
int dims;
IntDivider sizes_[25];
// NOTE: this approach will not support nInputs == 0
unsigned int strides_[25][NARGS];
#define C10_HOST_DEVICE __host__ __device__
#define C10_DEVICE __device__
template <typename T>
__device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
return __shfl_down_sync(mask, value, delta, width);
#if 0
template <typename T>
__device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
return std::complex<T>(
__shfl_down_sync(mask, value.real(), delta, width),
__shfl_down_sync(mask, value.imag(), delta, width));
// aligned vector generates vectorized load/store on CUDA
template<typename scalar_t, int vec_size>
struct alignas(sizeof(scalar_t) * vec_size) aligned_vector {
scalar_t val[vec_size];
C10_HOST_DEVICE static void reduce_fraction(size_t &numerator, size_t &denominator) {
// get GCD of num and denom using Euclid's algorithm.
// Can replace this with std::gcd if we ever support c++17.
size_t a = denominator;
size_t b = numerator;
while (b != 0) {
a %= b;
// swap(a,b)
size_t tmp = a;
a = b;
b = tmp;
// a is now the GCD
numerator /= a;
denominator /= a;
struct ReduceConfig {
//has to match host-side ReduceConfig in the eager code
static constexpr int BLOCK_X = 0;
static constexpr int BLOCK_Y = 1;
static constexpr int CTA = 2;
static constexpr int input_vec_size = 4;
int element_size_bytes;
int num_inputs;
int num_outputs;
int step_input = 1;
int step_output = 1;
int ctas_per_output = 1;
int input_mult[3] = {0, 0, 0};
int output_mult[2] = {0, 0};
int block_width;
int block_height;
int num_threads;
bool vectorize_input = false;
int output_vec_size = 1;
C10_HOST_DEVICE bool should_block_x_reduce() const {
return input_mult[BLOCK_X] != 0;
C10_HOST_DEVICE bool should_block_y_reduce() const {
return input_mult[BLOCK_Y] != 0;
C10_HOST_DEVICE bool should_global_reduce() const {
return input_mult[CTA] != 0;
C10_DEVICE bool should_store(int output_idx) const {
return output_idx < num_outputs &&
(!should_block_x_reduce() || threadIdx.x == 0) &&
(!should_block_y_reduce() || threadIdx.y == 0);
C10_DEVICE bool should_reduce_tail() const {
return (!should_block_y_reduce() || threadIdx.y == 0) &&
(!should_global_reduce() || blockIdx.y == 0);
C10_HOST_DEVICE int input_idx() const {
int lane = threadIdx.x;
int warp = threadIdx.y;
int cta2 = blockIdx.y;
return (lane * input_mult[BLOCK_X] +
warp * input_mult[BLOCK_Y] +
cta2 * input_mult[CTA]);
template <int output_vec_size>
C10_HOST_DEVICE int output_idx() const {
int lane = threadIdx.x;
int warp = threadIdx.y;
int cta1 = blockIdx.x;
return (lane * output_mult[BLOCK_X] +
warp * output_mult[BLOCK_Y] +
cta1 * step_output) * output_vec_size;
C10_DEVICE int shared_memory_offset(int offset) const {
return threadIdx.x + (threadIdx.y + offset) * blockDim.x;
C10_DEVICE int staging_memory_offset(int cta2) const {
int offset = cta2 + blockIdx.x * gridDim.y;
if (!should_block_x_reduce()) {
offset = threadIdx.x + offset * blockDim.x;
return offset;
//TODO this will need to be different for more generic reduction functions
namespace reducer {
using scalar_t = int64_t;
using arg_t = int64_t;
using out_scalar_t = int64_t;
inline __device__ arg_t combine(arg_t a, arg_t b) { return a * b; }
inline __device__ out_scalar_t project(arg_t arg) {
return (out_scalar_t) arg;
inline __device__ arg_t warp_shfl_down(arg_t arg, int offset) {
return WARP_SHFL_DOWN(arg, offset);
inline __device__ arg_t translate_idx(arg_t acc, int64_t /*idx*/) {
return acc;
// wrap a normal reduction that ignores the index
inline __device__ arg_t reduce(arg_t acc, arg_t val, int64_t idx) {
return combine(acc, val);
struct ReduceJitOp {
using scalar_t = int64_t;
using arg_t = int64_t;
using out_scalar_t = int64_t;
using InputCalculator = OffsetCalculator<1>;
using OutputCalculator = OffsetCalculator<2>;
// static constexpr bool can_accumulate_in_output =
// std::is_convertible<arg_t, out_scalar_t>::value
// && std::is_convertible<out_scalar_t, arg_t>::value;
static constexpr int input_vec_size = ReduceConfig::input_vec_size;
arg_t ident;
ReduceConfig config;
InputCalculator input_calc;
OutputCalculator output_calc;
const void* src;
const char* dst[2]; //it accepts at most two destinations
// acc_buf used for accumulation among sub Tensor Iterator when accumulation on
// output is not permissible
void* acc_buf;
// cta_buf used for accumulation between blocks during global reduction
void* cta_buf;
int* semaphores;
int64_t base_idx;
bool accumulate;
bool final_output;
int noutputs;
C10_DEVICE void run() const {
extern __shared__ char shared_memory[];
uint32_t output_idx = config.output_idx<1>();
uint32_t input_idx = config.input_idx();
auto base_offsets1 = output_calc.get(output_idx)[1];
using arg_vec_t = Array<arg_t, 1>;
arg_vec_t value;
if (output_idx < config.num_outputs && input_idx < config.num_inputs) {
const scalar_t* input_slice = (const scalar_t*)((const char*)src + base_offsets1);
value = thread_reduce<1>(input_slice);
if (config.should_block_y_reduce()) {
value = block_y_reduce<1>(value, shared_memory);
if (config.should_block_x_reduce()) {
value = block_x_reduce<1>(value, shared_memory);
using out_ptr_vec_t = Array<out_scalar_t*, 1>;
using offset_vec_t = Array<uint32_t, 1>;
offset_vec_t base_offsets;
out_ptr_vec_t out;
#pragma unroll
for (int i = 0; i < 1; i++) {
base_offsets[i] = output_calc.get(output_idx + i)[0];
out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
arg_vec_t* acc = nullptr;
if (acc_buf != nullptr) {
size_t numerator = sizeof(arg_t);
size_t denominator = sizeof(out_scalar_t);
reduce_fraction(numerator, denominator);
acc = (arg_vec_t*)((char*)acc_buf + (base_offsets[0] * numerator / denominator));
if (config.should_global_reduce()) {
value = global_reduce<1>(value, acc, shared_memory);
} else if (config.should_store(output_idx)) {
if (accumulate) {
#pragma unroll
for (int i = 0; i < 1; i++) {
value[i] = reducer::translate_idx(value[i], base_idx);
if (acc == nullptr) {
if (accumulate) {
value = accumulate_in_output<1>(out, value);
if (final_output) {
set_results_to_output<1>(value, base_offsets);
} else {
#pragma unroll
for (int i = 0; i < 1; i++) {
*(out[i]) = get_accumulated_output(out[i], value[i]);
} else {
if (accumulate) {
#pragma unroll
for (int i = 0; i < 1; i++) {
value[i] = reducer::combine((*acc)[i], value[i]);
if (final_output) {
set_results_to_output<1>(value, base_offsets);
} else {
*acc = value;
template <int output_vec_size>
C10_DEVICE Array<arg_t, output_vec_size> thread_reduce(const scalar_t* data) const {
if (config.vectorize_input) {
assert(output_vec_size == 1);
// reduce at the header of input_slice where memory is not aligned,
// so that thread_reduce will have an aligned memory to work on.
return {input_vectorized_thread_reduce_impl(data)};
} else {
uint32_t element_stride = input_calc.strides_[0][0] / sizeof(scalar_t);
bool is_contiguous = (input_calc.dims == 1 && element_stride == 1);
if (is_contiguous) {
return thread_reduce_impl<output_vec_size>(data, [](uint32_t idx) { return idx; });
} else if (input_calc.dims == 1) {
return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return idx * element_stride; });
} else {
return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return input_calc.get(idx)[0] / sizeof(scalar_t); });
C10_DEVICE arg_t input_vectorized_thread_reduce_impl(const scalar_t* data) const {
uint32_t end = config.num_inputs;
// Handle the head of input slice where data is not aligned
arg_t value = ident;
constexpr int align_bytes = alignof(aligned_vector<scalar_t, input_vec_size>);
constexpr int align_elements = align_bytes / sizeof(scalar_t);
int shift = ((int64_t)data) % align_bytes / sizeof(scalar_t);
if (shift > 0) {
data -= shift;
end += shift;
if(threadIdx.x >= shift && threadIdx.x < align_elements && config.should_reduce_tail()){
value = reducer::reduce(value, data[threadIdx.x], threadIdx.x - shift);
end -= align_elements;
data += align_elements;
shift = align_elements - shift;
// Do the vectorized reduction
using load_t = aligned_vector<scalar_t, input_vec_size>;
uint32_t idx = config.input_idx();
const uint32_t stride = config.step_input;
// Multiple accumulators to remove dependency between unrolled loops.
arg_t value_list[input_vec_size];
value_list[0] = value;
#pragma unroll
for (int i = 1; i < input_vec_size; i++) {
value_list[i] = ident;
scalar_t values[input_vec_size];
load_t *values_vector = reinterpret_cast<load_t*>(&values[0]);
while (idx * input_vec_size + input_vec_size - 1 < end) {
*values_vector = reinterpret_cast<const load_t*>(data)[idx];
#pragma unroll
for (uint32_t i = 0; i < input_vec_size; i++) {
value_list[i] = reducer::reduce(value_list[i], values[i], shift + idx * input_vec_size + i);
idx += stride;
// tail
uint32_t tail_start = end - end % input_vec_size;
if (config.should_reduce_tail()) {
int idx = tail_start + threadIdx.x;
if (idx < end) {
value_list[0] = reducer::reduce(value_list[0], data[idx], idx + shift);
// combine accumulators
#pragma unroll
for (int i = 1; i < input_vec_size; i++) {
value_list[0] = reducer::combine(value_list[0], value_list[i]);
return value_list[0];
template <int output_vec_size, typename offset_calc_t>
C10_DEVICE Array<arg_t, output_vec_size> thread_reduce_impl(const scalar_t* data_, offset_calc_t calc) const {
uint32_t idx = config.input_idx();
const uint32_t end = config.num_inputs;
const uint32_t stride = config.step_input;
const int vt0=4;
using arg_vec_t = Array<arg_t, output_vec_size>;
using load_t = aligned_vector<scalar_t, output_vec_size>;
const load_t* data = reinterpret_cast<const load_t*>(data_);
// Multiple accumulators to remove dependency between unrolled loops.
arg_vec_t value_list[vt0];
#pragma unroll
for (int i = 0; i < vt0; i++) {
#pragma unroll
for (int j = 0; j < output_vec_size; j++) {
value_list[i][j] = ident;
load_t values[vt0];
while (idx + (vt0 - 1) * stride < end) {
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
values[i] = data[calc(idx + i * stride) / output_vec_size];
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
#pragma unroll
for (uint32_t j = 0; j < output_vec_size; j++) {
value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx + i * stride);
idx += stride * vt0;
// tail
int idx_ = idx;
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
if (idx >= end) {
values[i] = data[calc(idx) / output_vec_size];
idx += stride;
idx = idx_;
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
if (idx >= end) {
#pragma unroll
for (uint32_t j = 0; j < output_vec_size; j++) {
value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx);
idx += stride;
// combine accumulators
#pragma unroll
for (int i = 1; i < vt0; i++) {
#pragma unroll
for (uint32_t j = 0; j < output_vec_size; j++) {
value_list[0][j] = reducer::combine(value_list[0][j], value_list[i][j]);
return value_list[0];
template <int output_vec_size>
C10_DEVICE Array<arg_t, output_vec_size> block_x_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
using args_vec_t = Array<arg_t, output_vec_size>;
int dim_x = blockDim.x;
args_vec_t* shared = (args_vec_t*)shared_memory;
if (dim_x > warpSize) {
int address_base = threadIdx.x + threadIdx.y*blockDim.x;
shared[address_base] = value;
for (int offset = dim_x/2; offset >= warpSize; offset >>= 1) {
if (threadIdx.x < offset && threadIdx.x + offset < blockDim.x) {
args_vec_t other = shared[address_base + offset];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], other[i]);
shared[address_base] = value;
dim_x = warpSize;
for (int offset = 1; offset < dim_x; offset <<= 1) {
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
arg_t other = reducer::warp_shfl_down(value[i], offset);
value[i] = reducer::combine(value[i], other);
return value;
template <int output_vec_size>
C10_DEVICE Array<arg_t, output_vec_size> block_y_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
using args_vec_t = Array<arg_t, output_vec_size>;
args_vec_t* shared = (args_vec_t*)shared_memory;
shared[config.shared_memory_offset(0)] = value;
for (int offset = blockDim.y / 2; offset > 0; offset >>= 1) {
if (threadIdx.y < offset && threadIdx.y + offset < blockDim.y) {
args_vec_t other = shared[config.shared_memory_offset(offset)];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], other[i]);
shared[config.shared_memory_offset(0)] = value;
return value;
C10_DEVICE bool mark_block_finished() const {
__shared__ bool is_last_block_done_shared;
if (threadIdx.x == 0 && threadIdx.y == 0) {
int prev_blocks_finished = atomicAdd(&semaphores[blockIdx.x], 1);
is_last_block_done_shared = (prev_blocks_finished == gridDim.y - 1);
return is_last_block_done_shared;
template <int output_vec_size>
C10_DEVICE Array<arg_t, output_vec_size> accumulate_in_output(
Array<out_scalar_t*, output_vec_size> out,
Array<arg_t, output_vec_size> value
) const {
Array<arg_t, output_vec_size> ret;
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
ret[i] = reducer::combine(*(out[i]), value[i]);
return ret;
C10_DEVICE out_scalar_t get_accumulated_output(
out_scalar_t* out, arg_t value
) const {
return (out_scalar_t)value;
template<class T>
C10_DEVICE void set_results(const T x, const uint32_t base_offset) const {
assert(noutputs == 1);
auto res = (out_scalar_t*)((char*)dst[0] + base_offset);
*res = x;
//TODO - multi-output reduction - we won't be able to use thrust::pair
//just explicitly specify typed output reads/writes
//Currently implemented for max of two outputs
// template<class T1, class T2>
// C10_DEVICE void set_results(const thrust::pair<T1, T2> x, const index_t base_offset) const {
// if (noutputs >= 1) {
// auto res0 = (T1*)((char*)dst[0] + base_offset);
// *res0 = x.first;
// }
// if (noutputs >= 2) {
// // base offset is computed assuming element size being sizeof(T1), so we need to make a
// // correction to obtain the correct base offset
// auto res1 = (T2*) ((char *) dst[1] + base_offset / sizeof(T1) * sizeof(T2));
// *res1 = x.second;
// }
// }
template <int output_vec_size>
C10_DEVICE void set_results_to_output(Array<arg_t, output_vec_size> value, Array<uint32_t, output_vec_size> base_offset) const {
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
set_results(reducer::project(value[i]), base_offset[i]);
template <int output_vec_size>
C10_DEVICE Array<arg_t, output_vec_size> global_reduce(Array<arg_t, output_vec_size> value, Array<arg_t, output_vec_size> *acc, char* shared_memory) const {
using arg_vec_t = Array<arg_t, output_vec_size>;
using out_ptr_vec_t = Array<out_scalar_t*, output_vec_size>;
using offset_vec_t = Array<uint32_t, output_vec_size>;
arg_vec_t* reduce_buffer = (arg_vec_t*)cta_buf;
uint32_t output_idx = config.output_idx<output_vec_size>();
offset_vec_t base_offsets;
out_ptr_vec_t out;
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
base_offsets[i] = output_calc.get(output_idx + i)[0];
out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
bool should_store = config.should_store(output_idx);
if (should_store) {
uint32_t offset = config.staging_memory_offset(blockIdx.y);
reduce_buffer[offset] = value;
__threadfence(); // make sure writes are globally visible
__syncthreads(); // if multiple warps in this block wrote to staging, make sure they're all done
bool is_last_block_done = mark_block_finished();
if (is_last_block_done) {
value = ident;
if (config.should_block_x_reduce()) {
uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;
uint32_t step = blockDim.x * blockDim.y;
for (; input_offset < config.ctas_per_output; input_offset += step) {
uint32_t idx = config.staging_memory_offset(input_offset);
arg_vec_t next = reduce_buffer[idx];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], next[i]);
} else {
uint32_t input_offset = threadIdx.y;
uint32_t step = blockDim.y;
for (; input_offset < config.ctas_per_output; input_offset += step) {
uint32_t idx = config.staging_memory_offset(input_offset);
arg_vec_t next = reduce_buffer[idx];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], next[i]);
value = block_y_reduce(value, shared_memory);
if (config.should_block_x_reduce()) {
value = block_x_reduce<output_vec_size>(value, shared_memory);
if (should_store) {
if (accumulate) {
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::translate_idx(value[i], base_idx);
if (acc == nullptr) {
if (accumulate) {
value = accumulate_in_output<output_vec_size>(out, value);
if (final_output) {
set_results_to_output<output_vec_size>(value, base_offsets);
} else {
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
*(out[i]) = get_accumulated_output(out[i], value[i]);
} else {
if (accumulate) {
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine((*acc)[i], value[i]);
if (final_output) {
set_results_to_output<output_vec_size>(value, base_offsets);
} else {
*acc = value;
return value;
extern "C"
__launch_bounds__(512, 4)
__global__ void reduction_prod_kernel(ReduceJitOp r){;
nvrtc: error: invalid value for --gpu-architecture (-arch)
My GPU is a Nvidia RTX 4090. How can I get rid of this problem? TIA
The warning means that the detector cannot detect anything. Can you try re-installing Grounded-DINO/SAM? It seems like a problem on either one of these dependencies.
And how about the log during the demo_text? Okay I am re-installing Grounded-DINO/SAM.
I believe Grounding-DINO/SAM is installed properly.
# packages in environment at /home/soicroot/miniconda3/envs/dev:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 2.1.0 pypi_0 pypi
accelerate 0.30.1 pypi_0 pypi
addict 2.4.0 pypi_0 pypi
aiofiles 23.2.1 pypi_0 pypi
aliyun-python-sdk-core 2.15.1 pypi_0 pypi
aliyun-python-sdk-kms 2.16.3 pypi_0 pypi
altair 5.3.0 pypi_0 pypi
annotated-types 0.6.0 pypi_0 pypi
anyio 4.3.0 pypi_0 pypi
asttokens 2.4.1 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
beautifulsoup4 4.12.3 pypi_0 pypi
blas 1.0 mkl
boto3 1.34.105 pypi_0 pypi
botocore 1.34.105 pypi_0 pypi
brotli-python 1.0.9 py39h6a678d5_8
bzip2 1.0.8 h5eee18b_6
ca-certificates 2024.3.11 h06a4308_0
certifi 2024.2.2 py39h06a4308_0
cffi 1.16.0 pypi_0 pypi
charset-normalizer 2.0.4 pyhd3eb1b0_0
chumpy 0.70 pypi_0 pypi
click 8.1.7 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
coloredlogs 15.0.1 pypi_0 pypi
comm 0.2.2 pypi_0 pypi
contourpy 1.2.1 pypi_0 pypi
crcmod 1.7 pypi_0 pypi
cryptography 42.0.7 pypi_0 pypi
cudatoolkit 11.3.1 h2bc3f7f_2
cycler 0.12.1 pypi_0 pypi
cython 3.0.10 pypi_0 pypi
debugpy 1.8.1 pypi_0 pypi
decorator 5.1.1 pypi_0 pypi
defusedxml 0.7.1 pypi_0 pypi
deva 1.0.0 pypi_0 pypi
diffusers 0.27.2 pypi_0 pypi
dnspython 2.6.1 pypi_0 pypi
easydict 1.13 pypi_0 pypi
einops 0.8.0 pypi_0 pypi
email-validator 2.1.1 pypi_0 pypi
exceptiongroup 1.2.1 pypi_0 pypi
executing 2.0.1 pypi_0 pypi
fairscale 0.4.4 pypi_0 pypi
fastapi 0.111.0 pypi_0 pypi
fastapi-cli 0.0.3 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
ffmpy 0.3.2 pypi_0 pypi
filelock 3.14.0 pypi_0 pypi
filterpy 1.4.5 pypi_0 pypi
flatbuffers 24.3.25 pypi_0 pypi
fonttools 4.51.0 pypi_0 pypi
freetype 2.12.1 h4a9f257_0
freetype-py 2.4.0 pypi_0 pypi
fsspec 2024.5.0 pypi_0 pypi
gdown 5.2.0 pypi_0 pypi
gitdb 4.0.11 pypi_0 pypi
gitpython 3.1.43 pypi_0 pypi
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
gradio 4.31.2 pypi_0 pypi
gradio-client 0.16.3 pypi_0 pypi
groundingdino 0.1.0 dev_0 <develop>
grpcio 1.63.0 pypi_0 pypi
gurobipy 11.0.2 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
h5py 3.11.0 pypi_0 pypi
hickle 5.0.3 pypi_0 pypi
httpcore 1.0.5 pypi_0 pypi
httptools 0.6.1 pypi_0 pypi
httpx 0.27.0 pypi_0 pypi
huggingface-hub 0.23.0 pypi_0 pypi
humanfriendly 10.0 pypi_0 pypi
idna 3.7 py39h06a4308_0
imageio 2.34.1 pypi_0 pypi
importlib-metadata 7.1.0 pypi_0 pypi
importlib-resources 6.4.0 pypi_0 pypi
intel-openmp 2023.1.0 hdb19cb5_46306
ipykernel 6.29.4 pypi_0 pypi
ipython 8.18.1 pypi_0 pypi
jedi 0.19.1 pypi_0 pypi
jinja2 3.1.4 pypi_0 pypi
jmespath 0.10.0 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
jpeg 9e h5eee18b_1
json-tricks 3.17.3 pypi_0 pypi
jsonschema 4.22.0 pypi_0 pypi
jsonschema-specifications 2023.12.1 pypi_0 pypi
jupyter-client 8.6.1 pypi_0 pypi
jupyter-core 5.7.2 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
lame 3.100 h7b6447c_0
lazy-loader 0.4 pypi_0 pypi
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libdeflate 1.17 h5eee18b_1
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h5eee18b_3
libidn2 2.3.4 h5eee18b_0
libpng 1.6.39 h5eee18b_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libtiff 4.5.1 h6a678d5_0
libunistring 0.9.10 h27cfd23_0
libwebp-base 1.3.2 h5eee18b_0
llvmlite 0.42.0 pypi_0 pypi
lz4-c 1.9.4 h6a678d5_1
markdown 3.6 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.5 pypi_0 pypi
matplotlib 3.8.4 pypi_0 pypi
matplotlib-inline 0.1.7 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344
mkl-service 2.4.0 py39h5eee18b_1
mkl_fft 1.3.8 py39h5eee18b_0
mkl_random 1.2.4 py39hdb19cb5_0
mmcv-full 1.7.1 pypi_0 pypi
mmpose 0.28.0 pypi_0 pypi
model-index 0.1.11 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
munkres 1.1.4 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nest-asyncio 1.6.0 pypi_0 pypi
nettle 3.7.3 hbbd107a_1
networkx 3.2.1 pypi_0 pypi
numba 0.59.1 pypi_0 pypi
numpy 1.26.4 py39h5f9d8c6_0
numpy-base 1.26.4 py39hb5e798b_0
onnx 1.16.0 pypi_0 pypi
onnxruntime 1.17.3 pypi_0 pypi
opencv-python pypi_0 pypi
opencv-python-headless pypi_0 pypi
opendatalab 0.0.10 pypi_0 pypi
openh264 2.1.1 h4ff587b_0
openjpeg 2.4.0 h3ad879b_0
openmim 0.3.9 pypi_0 pypi
openssl 3.0.13 h7f8727e_1
openxlab 0.0.38 pypi_0 pypi
ordered-set 4.1.0 pypi_0 pypi
orjson 3.10.3 pypi_0 pypi
oss2 2.17.0 pypi_0 pypi
packaging 24.0 pypi_0 pypi
pandas 2.2.2 pypi_0 pypi
parso 0.8.4 pypi_0 pypi
pexpect 4.9.0 pypi_0 pypi
pillow 10.3.0 py39h5eee18b_0
pip 24.0 py39h06a4308_0
platformdirs 4.2.2 pypi_0 pypi
plyfile 1.0.3 pypi_0 pypi
prompt-toolkit 3.0.43 pypi_0 pypi
protobuf 5.26.1 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pulp 2.8.0 pypi_0 pypi
pure-eval 0.2.2 pypi_0 pypi
pycocoevalcap 1.2 pypi_0 pypi
pycocotools 2.0.7 pypi_0 pypi
pycparser 2.22 pypi_0 pypi
pycryptodome 3.20.0 pypi_0 pypi
pydantic 2.7.1 pypi_0 pypi
pydantic-core 2.18.2 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pyglet 1.5.27 pypi_0 pypi
pygments 2.18.0 pypi_0 pypi
pyopengl 3.1.0 pypi_0 pypi
pyparsing 3.1.2 pypi_0 pypi
pyrender 0.1.45 pypi_0 pypi
pysocks 1.7.1 py39h06a4308_0
python 3.9.19 h955ad1f_1
python-dateutil 2.9.0.post0 pypi_0 pypi
python-dotenv 1.0.1 pypi_0 pypi
python-multipart 0.0.9 pypi_0 pypi
pytorch 1.12.0 py3.9_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2023.4 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
pyzmq 26.0.3 pypi_0 pypi
readline 8.2 h5eee18b_0
referencing 0.35.1 pypi_0 pypi
regex 2024.5.15 pypi_0 pypi
requests 2.28.2 pypi_0 pypi
rich 13.4.2 pypi_0 pypi
rpds-py 0.18.1 pypi_0 pypi
ruff 0.4.4 pypi_0 pypi
s3transfer 0.10.1 pypi_0 pypi
sacremoses 0.1.1 pypi_0 pypi
safetensors 0.4.3 pypi_0 pypi
scikit-image 0.22.0 pypi_0 pypi
scikit-learn 1.4.2 pypi_0 pypi
scipy 1.13.0 pypi_0 pypi
segment-anything 1.0 dev_0 <develop>
semantic-version 2.10.0 pypi_0 pypi
setuptools 60.2.0 pypi_0 pypi
shellingham 1.5.4 pypi_0 pypi
six 1.16.0 pypi_0 pypi
smmap 5.0.1 pypi_0 pypi
smplx 0.1.28 pypi_0 pypi
sniffio 1.3.1 pypi_0 pypi
soupsieve 2.5 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0
stack-data 0.6.3 pypi_0 pypi
starlette 0.37.2 pypi_0 pypi
supervision 0.20.0 pypi_0 pypi
sympy 1.12 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
tensorboard 2.16.2 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorboardx pypi_0 pypi
thinplate 1.0.0 pypi_0 pypi
threadpoolctl 3.5.0 pypi_0 pypi
tifffile 2024.5.10 pypi_0 pypi
timm 0.4.12 pypi_0 pypi
tk 8.6.14 h39e8969_0
tokenizers 0.10.3 pypi_0 pypi
tomli 2.0.1 pypi_0 pypi
tomlkit 0.12.0 pypi_0 pypi
toolz 0.12.1 pypi_0 pypi
torchgeometry 0.1.2 pypi_0 pypi
torchvision 0.13.0 py39_cu113 pytorch
tornado 6.4 pypi_0 pypi
tqdm 4.65.2 pypi_0 pypi
traitlets 5.14.3 pypi_0 pypi
transformers 4.15.0 pypi_0 pypi
trimesh 4.3.2 pypi_0 pypi
typer 0.12.3 pypi_0 pypi
typing_extensions 4.11.0 py39h06a4308_0
tzdata 2024.1 pypi_0 pypi
ujson 5.10.0 pypi_0 pypi
urllib3 1.26.18 pypi_0 pypi
uvicorn 0.29.0 pypi_0 pypi
uvloop 0.19.0 pypi_0 pypi
watchfiles 0.21.0 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
websockets 11.0.3 pypi_0 pypi
werkzeug 3.0.3 pypi_0 pypi
wheel 0.43.0 py39h06a4308_0
xtcocotools 1.14.3 pypi_0 pypi
xz 5.4.6 h5eee18b_1
yacs 0.1.8 pypi_0 pypi
yapf 0.40.2 pypi_0 pypi
zipp 3.18.1 pypi_0 pypi
zlib 1.2.13 h5eee18b_1
zstd 1.5.5 hc292b87_2
These are the packages installed in my conda env. Still same error
I have observed that this error is related to GPU RTX 4090 due to sm89 architecture, in RTX 2060 I did not get this error
Can it be a compilation error? If it is compiled for 2060 it would not work for different SM
Please feel free to re-open if there are any updates.
Was someone able to solve this? I am having same error on 4090
Can you please check if any other process is running on your GPU?
Can you please check if any other process is running on your GPU?
Here is the output of nvidia-smi:
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 0% 43C P8 27W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:03:00.0 Off | Off |
| 0% 42C P8 38W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| No running processes found |
Ok, looks like everything got fixed after googling. I was able to solve it by updating the pytorch and torchvision to 2.1.2:
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url
Hope this will help to someone too