Skip to main content

Nvidia Data Center GPU Manager (DCGM)

Plugin: go.d.plugin Module: dcgm

Overview

This collector gathers NVIDIA GPU telemetry from a dcgm-exporter endpoint. It supports all numeric fields exposed by the exporter and maps them into Netdata-native contexts.

It collects metrics by periodically scraping the exporter Prometheus endpoint over HTTP.

This collector is supported on all platforms.

This collector supports collecting metrics from multiple instances of this integration, including remote instances.

Nvidia Data Center GPU Manager (DCGM) can be monitored further using the following other integrations:

Default Behavior

Auto-Detection

This integration does not support auto-detection in v1.

Limits

The collector applies global and per-metric time series limits to prevent excessive cardinality.

Performance Impact

The impact depends on dcgm-exporter field selection and resulting series cardinality.

Metrics

Metrics grouped by scope.

The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.

Metrics are grouped into static Netdata contexts. Contexts are created only when matching DCGM fields are present in the exporter output.

Per gpu

These metrics refer to GPU device instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
uuiduuid label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.gpu.capability.supportcc_mode, cuda_compute_capability, gpm_support, mig_attributes, mig_ci_info, mig_gi_info, mig_max_slices, supported_clocks, supported_type_infostate
dcgm.gpu.clock.frequencyapp_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clockMHz
dcgm.gpu.compute.activitydram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor%
dcgm.gpu.compute.tensor.activitytensor_dfma, tensor_hmma, tensor_imma%
dcgm.gpu.compute.media.activitynvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active%
dcgm.gpu.compute.cache.activityhostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_missevents/s
dcgm.gpu.compute.utilizationdecoder, encoder, gpu, memory_copy%
dcgm.gpu.cpu.powermodule_power_util_current, sysio_power_util_currentWatts
dcgm.gpu.cpu.infocpu_model, cpu_vendorvalue
dcgm.gpu.diagnostics.resultsdiag_diagnostic_result, diag_eud_result, diag_memory_bandwidth_result, diag_memory_result, diag_memtest_result, diag_nccl_tests_result, diag_nvbandwidth_result, diag_pulse_test_result, diag_software_result, diag_targeted_power_result, diag_targeted_stress_resultstate
dcgm.gpu.diagnostics.statusdiag_statusstate
dcgm.gpu.health.statusimex_daemon_status, imex_domain_statusstate
dcgm.gpu.interconnect.connectx.error_statusconnectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_statusstate
dcgm.gpu.interconnect.connectx.errorsconnectx_correctable_err_mask, connectx_correctable_err_status, connectx_uncorrectable_err_mask, connectx_uncorrectable_err_severity, connectx_uncorrectable_err_statuserrors/s
dcgm.gpu.interconnect.connectx.linkconnectx_active_pcie_link_speed, connectx_expect_pcie_link_speedvalue
dcgm.gpu.interconnect.connectx.statusconnectx_healthstate
dcgm.gpu.interconnect.error_ratec2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2berrors/s
dcgm.gpu.interconnect.fabricfabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_statusstate
dcgm.gpu.interconnect.nvlink.error_rategpu_nvlink_errorserrors/s
dcgm.gpu.interconnect.pcie.error_ratepcie_count_correctable_errors, pcie_replayerrors/s
dcgm.gpu.interconnect.pcie.link.generationlink_gen, max_link_gengeneration
dcgm.gpu.interconnect.pcie.link.widthconnectx_active_pcie_link_width, connectx_expect_pcie_link_width, link_width, max_link_widthlanes
dcgm.gpu.interconnect.statec2c_link, c2c_link_power_state, c2c_link_statusstate
dcgm.gpu.interconnect.pcie.statediag_pcie_resultstate
dcgm.gpu.interconnect.throughputc2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytesB/s
dcgm.gpu.interconnect.pcie.throughputpcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughputB/s
dcgm.gpu.interconnect.nvlink.throughputnvlink_rx, nvlink_txB/s
dcgm.gpu.interconnect.total.throughputpcie, nvlinkB/s
dcgm.gpu.internal.boundaryfirst_connectx_field_id, first_vgpu_field_id, internal_fields_0_end, internal_fields_0_start, last_connectx_field_id, last_vgpu_field_idstate
dcgm.gpu.inventory.identitybrand, count, cuda_visible_devices_str, minor_number, name, nvml_index, serial, uuidvalue
dcgm.gpu.inventory.platformplatform_chassis_serial_number, platform_chassis_slot_number, platform_host_id, platform_infiniband_guid, platform_module_id, platform_peer_type, platform_tray_indexvalue
dcgm.gpu.inventory.softwareinforom_config_check, inforom_config_valid, inforom_image_ver, oem_inforom_ver, power_inforom_ver, process_name, vbios_versionvalue
dcgm.gpu.memory.bar1_usagefree, usedB
dcgm.gpu.memory.bar1_capacitytotalB
dcgm.gpu.memory.ecc_error_rateecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors/s
dcgm.gpu.memory.ecc_errorsecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors
dcgm.gpu.memory.page_retirementsretired_dbe, retired_pending, retired_sbepages/s
dcgm.gpu.memory.usagefree, reserved, usedB
dcgm.gpu.memory.capacitytotalB
dcgm.gpu.memory.utilizationused_percent%
dcgm.gpu.power.energytotalmJ/s
dcgm.gpu.power.profilesenforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_maskstate
dcgm.gpu.power.smoothingpwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_ratevalue
dcgm.gpu.power.usagedraw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instantWatts
dcgm.gpu.reliability.memory_healthbanks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srmstate
dcgm.gpu.reliability.recovery_actionget_gpu_recovery_actionstate
dcgm.gpu.reliability.row_remap_eventscorrectable_remapped_rows, uncorrectable_remapped_rowsrows/s
dcgm.gpu.reliability.row_remap_statusrow_remap_failure, row_remap_pendingstate
dcgm.gpu.reliability.xidxidcode
dcgm.gpu.state.configurationautoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violationstate
dcgm.gpu.state.performancepstatestate
dcgm.gpu.state.virtualizationmig_mode, virtual_modestate
dcgm.gpu.thermal.fan_speedfan_speed%
dcgm.gpu.thermal.temperatureconnectx_device_temperature, gpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_tempCelsius
dcgm.gpu.throttle.reasonsclocks_event_reasonsbitmask
dcgm.gpu.throttle.violationsboard_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violationmilliseconds/s
dcgm.gpu.topology.affinitycpu_affinity_0, cpu_affinity_1, cpu_affinity_2, cpu_affinity_3, gpu_topology_affinity, gpu_topology_pci, mem_affinity_0, mem_affinity_1, mem_affinity_2, mem_affinity_3, pci_busid, pci_combined_id, pci_subsys_idvalue
dcgm.gpu.virtualization.vgpu.frame_ratevgpu_frame_rate_limitfps
dcgm.gpu.virtualization.vgpu.instancevgpu_instance_ids, vgpu_pci_id, vgpu_uuidvalue
dcgm.gpu.virtualization.vgpu.licensevgpu_instance_license_state, vgpu_license_status, vgpu_type_licensestate
dcgm.gpu.virtualization.vgpu.memoryvgpu_memory_usageB
dcgm.gpu.virtualization.vgpu.sessionsvgpu_enc_sessions_info, vgpu_enc_stats, vgpu_fbc_sessions_info, vgpu_fbc_statsvalue
dcgm.gpu.virtualization.vgpu.softwarevgpu_driver_versionvalue
dcgm.gpu.virtualization.vgpu.typecreatable_vgpu_type_ids, supported_vgpu_type_ids, vgpu_type, vgpu_type_class, vgpu_type_info, vgpu_type_namevalue
dcgm.gpu.virtualization.vgpu.utilizationvgpu_per_process_utilization%
dcgm.gpu.virtualization.vgpu.vmvgpu_vm_gpu_instance_id, vgpu_vm_id, vgpu_vm_namevalue
dcgm.gpu.workload.sessionsaccounting_data, enc_stats, fbc_sessions_info, fbc_statsvalue

Per mig

These metrics refer to MIG instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
gpu_i_idgpu_i_id label from exporter metrics.
gpu_i_profilegpu_i_profile label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.mig.clock.frequencyapp_mem_clock, app_sm_clock, max_mem_clock, max_sm_clock, max_video_clock, memory, sm, video_clockMHz
dcgm.mig.compute.activitydram, fp16, fp32, fp64, graphics_engine_active, integer, sm_active, sm_occupancy, tensor%
dcgm.mig.compute.tensor.activitytensor_dfma, tensor_hmma, tensor_imma%
dcgm.mig.compute.media.activitynvdec0_active, nvdec1_active, nvdec2_active, nvdec3_active, nvdec4_active, nvdec5_active, nvdec6_active, nvdec7_active, nvjpg0_active, nvjpg1_active, nvjpg2_active, nvjpg3_active, nvjpg4_active, nvjpg5_active, nvjpg6_active, nvjpg7_active, nvofa0_active, nvofa1_active%
dcgm.mig.compute.cache.activityhostmem_cache_hit, hostmem_cache_miss, peermem_cache_hit, peermem_cache_missevents/s
dcgm.mig.compute.utilizationdecoder, encoder, gpu, memory_copy%
dcgm.mig.interconnect.nvlink.bernvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_floatratio
dcgm.mig.interconnect.nvlink.congestionnvlink_ppcnt_ibpc_port_xmit_waitevents/s
dcgm.mig.interconnect.error_ratec2c_link_error_intr, c2c_link_error_replay, c2c_link_error_replay_b2berrors/s
dcgm.mig.interconnect.nvlink.error_rategpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9errors/s
dcgm.mig.interconnect.pcie.error_ratepcie_count_correctable_errors, pcie_replayerrors/s
dcgm.mig.interconnect.nvlink.errorsnvlink_ppcnt_plr_rcv_uncorrectable_codeerrors
dcgm.mig.interconnect.fabricfabric_clique_id, fabric_cluster_uuid, fabric_health_mask, fabric_manager_error_code, fabric_manager_statusstate
dcgm.mig.interconnect.pcie.link.generationlink_gen, max_link_gengeneration
dcgm.mig.interconnect.pcie.link.widthlink_width, max_link_widthlanes
dcgm.mig.interconnect.statec2c_link, c2c_link_power_state, c2c_link_statusstate
dcgm.mig.interconnect.pcie.statediag_pcie_resultstate
dcgm.mig.interconnect.nvlink.stategpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_statusstate
dcgm.mig.interconnect.throughputc2c_max_bandwidth, c2c_rx_all_bytes, c2c_rx_data_bytes, c2c_tx_all_bytes, c2c_tx_data_bytesB/s
dcgm.mig.interconnect.nvlink.throughputnvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_txB/s
dcgm.mig.interconnect.pcie.throughputpcie_rx, pcie_rx_throughput, pcie_tx, pcie_tx_throughputB/s
dcgm.mig.interconnect.total.throughputpcie, nvlinkB/s
dcgm.mig.interconnect.nvlink.trafficnvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codesevents/s
dcgm.mig.memory.bar1_usagefree, usedB
dcgm.mig.memory.bar1_capacitytotalB
dcgm.mig.memory.ecc_error_rateecc_current, ecc_dbe_agg, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_pending, ecc_sbe_agg, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_tex, nvlink_ecc_data_errorerrors/s
dcgm.mig.memory.ecc_errorsecc_current, ecc_dbe_agg_cbu, ecc_dbe_agg_dev, ecc_dbe_agg_l1, ecc_dbe_agg_l2, ecc_dbe_agg_reg, ecc_dbe_agg_shm, ecc_dbe_agg_srm, ecc_dbe_agg_tex, ecc_dbe_vol_cbu, ecc_dbe_vol_dev, ecc_dbe_vol_l1, ecc_dbe_vol_l2, ecc_dbe_vol_reg, ecc_dbe_vol_shm, ecc_dbe_vol_srm, ecc_dbe_vol_tex, ecc_inforom_ver, ecc_pending, ecc_sbe_agg_cbu, ecc_sbe_agg_dev, ecc_sbe_agg_l1, ecc_sbe_agg_l2, ecc_sbe_agg_reg, ecc_sbe_agg_shm, ecc_sbe_agg_srm, ecc_sbe_agg_tex, ecc_sbe_vol_cbu, ecc_sbe_vol_dev, ecc_sbe_vol_l1, ecc_sbe_vol_l2, ecc_sbe_vol_reg, ecc_sbe_vol_shm, ecc_sbe_vol_srm, ecc_sbe_vol_texerrors
dcgm.mig.memory.page_retirementsretired_dbe, retired_pending, retired_sbepages/s
dcgm.mig.memory.usagefree, reserved, usedB
dcgm.mig.memory.capacitytotalB
dcgm.mig.memory.utilizationused_percent%
dcgm.mig.power.energytotalmJ/s
dcgm.mig.power.profilesenforced_power_profile_mask, requested_power_profile_mask, valid_power_profile_maskstate
dcgm.mig.power.smoothingpwr_smoothing_active_preset_profile, pwr_smoothing_admin_override_percent_tmp_floor, pwr_smoothing_admin_override_ramp_down_hyst_val, pwr_smoothing_admin_override_ramp_down_rate, pwr_smoothing_admin_override_ramp_up_rate, pwr_smoothing_applied_tmp_ceil, pwr_smoothing_applied_tmp_floor, pwr_smoothing_enabled, pwr_smoothing_hw_circuitry_percent_lifetime_remaining, pwr_smoothing_imm_ramp_down_enabled, pwr_smoothing_max_num_preset_profiles, pwr_smoothing_max_percent_tmp_floor_setting, pwr_smoothing_min_percent_tmp_floor_setting, pwr_smoothing_priv_lvl, pwr_smoothing_profile_percent_tmp_floor, pwr_smoothing_profile_ramp_down_hyst_val, pwr_smoothing_profile_ramp_down_rate, pwr_smoothing_profile_ramp_up_ratevalue
dcgm.mig.power.usagedraw, enforced_limit, power_mgmt_limit, power_mgmt_limit_def, power_mgmt_limit_max, power_mgmt_limit_min, power_usage_instantWatts
dcgm.mig.reliability.memory_healthbanks_remap_rows_avail_high, banks_remap_rows_avail_low, banks_remap_rows_avail_max, banks_remap_rows_avail_none, banks_remap_rows_avail_partial, memory_unrepairable_flag, threshold_srmstate
dcgm.mig.reliability.recovery_actionget_gpu_recovery_actionstate
dcgm.mig.reliability.row_remap_eventscorrectable_remapped_rows, uncorrectable_remapped_rowsrows/s
dcgm.mig.reliability.row_remap_statusrow_remap_failure, row_remap_pendingstate
dcgm.mig.reliability.xidxidcode
dcgm.mig.state.configurationautoboost, compute_mode, persistence_mode, sync_boost, sync_boost_violationstate
dcgm.mig.state.performancepstatestate
dcgm.mig.state.virtualizationmig_mode, virtual_modestate
dcgm.mig.thermal.fan_speedfan_speed%
dcgm.mig.thermal.temperaturegpu, gpu_max_op_temp, gpu_temp_limit, mem_max_op_temp, memory, shutdown_temp, slowdown_tempCelsius
dcgm.mig.throttle.reasonsclocks_event_reasonsbitmask
dcgm.mig.throttle.violationsboard_limit_violation, hw_power_brake_slowdown, hw_therm_slowdown, low_utilization_violation, power_violation, reliability_violation, sw_power_cap, sw_therm_slowdown, sync_boost, thermal_violation, total_app_clocks_violation, total_base_clocks_violationmilliseconds/s

These metrics refer to NVLink link instances.

Labels:

LabelDescription
gpugpu label from exporter metrics.
gpu_uuidgpu_uuid label from exporter metrics.
nvlinknvlink label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.nvlink.interconnect.bernvlink_count_effective_ber, nvlink_count_effective_ber_float, nvlink_count_symbol_ber, nvlink_count_symbol_ber_floatratio
dcgm.nvlink.interconnect.congestionnvlink_ppcnt_ibpc_port_xmit_waitevents/s
dcgm.nvlink.interconnect.error_rategpu_nvlink_errors, nvlink_count_effective_errors, nvlink_count_fec_history_0, nvlink_count_fec_history_1, nvlink_count_fec_history_10, nvlink_count_fec_history_11, nvlink_count_fec_history_12, nvlink_count_fec_history_13, nvlink_count_fec_history_14, nvlink_count_fec_history_15, nvlink_count_fec_history_2, nvlink_count_fec_history_3, nvlink_count_fec_history_4, nvlink_count_fec_history_5, nvlink_count_fec_history_6, nvlink_count_fec_history_7, nvlink_count_fec_history_8, nvlink_count_fec_history_9, nvlink_count_link_recovery_events, nvlink_count_link_recovery_failed_events, nvlink_count_link_recovery_successful_events, nvlink_count_local_link_integrity_errors, nvlink_count_rx_buffer_overrun_errors, nvlink_count_rx_errors, nvlink_count_rx_general_errors, nvlink_count_rx_malformed_packet_errors, nvlink_count_rx_remote_errors, nvlink_count_rx_symbol_errors, nvlink_count_tx_discards, nvlink_crc_data_error, nvlink_crc_data_error_count_l0, nvlink_crc_data_error_count_l1, nvlink_crc_data_error_count_l10, nvlink_crc_data_error_count_l11, nvlink_crc_data_error_count_l12, nvlink_crc_data_error_count_l13, nvlink_crc_data_error_count_l14, nvlink_crc_data_error_count_l15, nvlink_crc_data_error_count_l16, nvlink_crc_data_error_count_l17, nvlink_crc_data_error_count_l2, nvlink_crc_data_error_count_l3, nvlink_crc_data_error_count_l4, nvlink_crc_data_error_count_l5, nvlink_crc_data_error_count_l6, nvlink_crc_data_error_count_l7, nvlink_crc_data_error_count_l8, nvlink_crc_data_error_count_l9, nvlink_crc_flit_error, nvlink_crc_flit_error_count_l0, nvlink_crc_flit_error_count_l1, nvlink_crc_flit_error_count_l10, nvlink_crc_flit_error_count_l11, nvlink_crc_flit_error_count_l12, nvlink_crc_flit_error_count_l13, nvlink_crc_flit_error_count_l14, nvlink_crc_flit_error_count_l15, nvlink_crc_flit_error_count_l16, nvlink_crc_flit_error_count_l17, nvlink_crc_flit_error_count_l2, nvlink_crc_flit_error_count_l3, nvlink_crc_flit_error_count_l4, nvlink_crc_flit_error_count_l5, nvlink_crc_flit_error_count_l6, nvlink_crc_flit_error_count_l7, nvlink_crc_flit_error_count_l8, nvlink_crc_flit_error_count_l9, nvlink_error_dl_crc, nvlink_error_dl_recovery, nvlink_error_dl_replay, nvlink_ppcnt_physical_successful_recovery_events, nvlink_ppcnt_plr_rcv_uncorrectable_code, nvlink_ppcnt_recovery_time_since_last, nvlink_ppcnt_recovery_total_successful_events, nvlink_pprm_oper_recovery, nvlink_recovery_error, nvlink_recovery_error_count_l0, nvlink_recovery_error_count_l1, nvlink_recovery_error_count_l10, nvlink_recovery_error_count_l11, nvlink_recovery_error_count_l12, nvlink_recovery_error_count_l13, nvlink_recovery_error_count_l14, nvlink_recovery_error_count_l15, nvlink_recovery_error_count_l16, nvlink_recovery_error_count_l17, nvlink_recovery_error_count_l2, nvlink_recovery_error_count_l3, nvlink_recovery_error_count_l4, nvlink_recovery_error_count_l5, nvlink_recovery_error_count_l6, nvlink_recovery_error_count_l7, nvlink_recovery_error_count_l8, nvlink_recovery_error_count_l9, nvlink_replay_error, nvlink_replay_error_count_l0, nvlink_replay_error_count_l1, nvlink_replay_error_count_l10, nvlink_replay_error_count_l11, nvlink_replay_error_count_l12, nvlink_replay_error_count_l13, nvlink_replay_error_count_l14, nvlink_replay_error_count_l15, nvlink_replay_error_count_l16, nvlink_replay_error_count_l17, nvlink_replay_error_count_l2, nvlink_replay_error_count_l3, nvlink_replay_error_count_l4, nvlink_replay_error_count_l5, nvlink_replay_error_count_l6, nvlink_replay_error_count_l7, nvlink_replay_error_count_l8, nvlink_replay_error_count_l9errors/s
dcgm.nvlink.interconnect.errorsnvlink_ppcnt_plr_rcv_uncorrectable_codeerrors
dcgm.nvlink.interconnect.stategpu_topology_nvlink, nvlink_get_state, nvlink_ppcnt_physical_link_down_counter, nvlink_ppcnt_plr_rcv_code_err, nvlink_ppcnt_plr_sync_events, nvlink_ppcnt_plr_xmit_retry_events, p2p_nvlink_statusstate
dcgm.nvlink.interconnect.throughputnvlink_bandwidth, nvlink_bandwidth_l0, nvlink_bandwidth_l1, nvlink_bandwidth_l10, nvlink_bandwidth_l11, nvlink_bandwidth_l12, nvlink_bandwidth_l13, nvlink_bandwidth_l14, nvlink_bandwidth_l15, nvlink_bandwidth_l16, nvlink_bandwidth_l17, nvlink_bandwidth_l2, nvlink_bandwidth_l3, nvlink_bandwidth_l4, nvlink_bandwidth_l5, nvlink_bandwidth_l6, nvlink_bandwidth_l7, nvlink_bandwidth_l8, nvlink_bandwidth_l9, nvlink_count_rx, nvlink_count_tx, nvlink_l0_rx, nvlink_l0_tx, nvlink_l10_rx, nvlink_l10_tx, nvlink_l11_rx, nvlink_l11_tx, nvlink_l12_rx, nvlink_l12_tx, nvlink_l13_rx, nvlink_l13_tx, nvlink_l14_rx, nvlink_l14_tx, nvlink_l15_rx, nvlink_l15_tx, nvlink_l16_rx, nvlink_l16_tx, nvlink_l17_rx, nvlink_l17_tx, nvlink_l1_rx, nvlink_l1_tx, nvlink_l2_rx, nvlink_l2_tx, nvlink_l3_rx, nvlink_l3_tx, nvlink_l4_rx, nvlink_l4_tx, nvlink_l5_rx, nvlink_l5_tx, nvlink_l6_rx, nvlink_l6_tx, nvlink_l7_rx, nvlink_l7_tx, nvlink_l8_rx, nvlink_l8_tx, nvlink_l9_rx, nvlink_l9_tx, nvlink_rx_bandwidth, nvlink_rx_bandwidth_l0, nvlink_rx_bandwidth_l1, nvlink_rx_bandwidth_l10, nvlink_rx_bandwidth_l11, nvlink_rx_bandwidth_l12, nvlink_rx_bandwidth_l13, nvlink_rx_bandwidth_l14, nvlink_rx_bandwidth_l15, nvlink_rx_bandwidth_l16, nvlink_rx_bandwidth_l17, nvlink_rx_bandwidth_l2, nvlink_rx_bandwidth_l3, nvlink_rx_bandwidth_l4, nvlink_rx_bandwidth_l5, nvlink_rx_bandwidth_l6, nvlink_rx_bandwidth_l7, nvlink_rx_bandwidth_l8, nvlink_rx_bandwidth_l9, nvlink_rx, nvlink_tx_bandwidth, nvlink_tx_bandwidth_l0, nvlink_tx_bandwidth_l1, nvlink_tx_bandwidth_l10, nvlink_tx_bandwidth_l11, nvlink_tx_bandwidth_l12, nvlink_tx_bandwidth_l13, nvlink_tx_bandwidth_l14, nvlink_tx_bandwidth_l15, nvlink_tx_bandwidth_l16, nvlink_tx_bandwidth_l17, nvlink_tx_bandwidth_l2, nvlink_tx_bandwidth_l3, nvlink_tx_bandwidth_l4, nvlink_tx_bandwidth_l5, nvlink_tx_bandwidth_l6, nvlink_tx_bandwidth_l7, nvlink_tx_bandwidth_l8, nvlink_tx_bandwidth_l9, nvlink_txB/s
dcgm.nvlink.interconnect.trafficnvlink_count_rx_packets, nvlink_count_tx_packets, nvlink_ppcnt_plr_rcv_codes, nvlink_ppcnt_plr_xmit_codes, nvlink_ppcnt_plr_xmit_retry_codesevents/s
dcgm.nvlink.internal.boundarynvlink_ppcnt_recovery_time_between_last_twostate
dcgm.nvlink.memory.ecc_error_ratenvlink_ecc_data_errorerrors/s

Per nvswitch

These metrics refer to NVSwitch instances.

Labels:

LabelDescription
nvswitchnvswitch label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.nvswitch.interconnect.nvswitch.currentnvswitch_current_iddq, nvswitch_current_iddq_dvdd, nvswitch_current_iddq_revvalue
dcgm.nvswitch.interconnect.nvswitch.errorsnvswitch_fatal_errors, nvswitch_link_crc_errors, nvswitch_link_crc_errors_lane0, nvswitch_link_crc_errors_lane1, nvswitch_link_crc_errors_lane2, nvswitch_link_crc_errors_lane3, nvswitch_link_crc_errors_lane4, nvswitch_link_crc_errors_lane5, nvswitch_link_crc_errors_lane6, nvswitch_link_crc_errors_lane7, nvswitch_link_fatal_errors, nvswitch_link_flit_errors, nvswitch_link_non_fatal_errors, nvswitch_link_recovery_errors, nvswitch_link_replay_errors, nvswitch_non_fatal_errorserrors/s
dcgm.nvswitch.interconnect.nvswitch.latencynvswitch_link_latency_count_vc0, nvswitch_link_latency_count_vc1, nvswitch_link_latency_count_vc2, nvswitch_link_latency_count_vc3, nvswitch_link_latency_high_vc0, nvswitch_link_latency_high_vc1, nvswitch_link_latency_high_vc2, nvswitch_link_latency_high_vc3, nvswitch_link_latency_low_vc0, nvswitch_link_latency_low_vc1, nvswitch_link_latency_low_vc2, nvswitch_link_latency_low_vc3, nvswitch_link_latency_medium_vc0, nvswitch_link_latency_medium_vc1, nvswitch_link_latency_medium_vc2, nvswitch_link_latency_medium_vc3, nvswitch_link_latency_panic_vc0, nvswitch_link_latency_panic_vc1, nvswitch_link_latency_panic_vc2, nvswitch_link_latency_panic_vc3events/s
dcgm.nvswitch.interconnect.nvswitch.powernvswitch_power_dvdd, nvswitch_power_hvdd, nvswitch_power_vddWatts
dcgm.nvswitch.interconnect.nvswitch.statusnvswitch_link_status, nvswitch_link_type, nvswitch_reset_requiredstate
dcgm.nvswitch.interconnect.nvswitch.throughputnvswitch_link_throughput_rx, nvswitch_link_throughput_tx, nvswitch_throughput_rx, nvswitch_throughput_txB/s
dcgm.nvswitch.interconnect.nvswitch.topologynvswitch_device_uuid, nvswitch_link_device_link_id, nvswitch_link_device_link_sid, nvswitch_link_id, nvswitch_link_remote_pcie_bus, nvswitch_link_remote_pcie_device, nvswitch_link_remote_pcie_domain, nvswitch_link_remote_pcie_function, nvswitch_pcie_bus, nvswitch_pcie_device, nvswitch_pcie_domain, nvswitch_pcie_function, nvswitch_phys_idvalue
dcgm.nvswitch.interconnect.nvswitch.voltagenvswitch_voltage_mvoltmV
dcgm.nvswitch.internal.boundaryfirst_nvswitch_field_id, last_nvswitch_field_idstate
dcgm.nvswitch.memory.ecc_error_ratenvswitch_link_ecc_errors, nvswitch_link_ecc_errors_lane0, nvswitch_link_ecc_errors_lane1, nvswitch_link_ecc_errors_lane2, nvswitch_link_ecc_errors_lane3, nvswitch_link_ecc_errors_lane4, nvswitch_link_ecc_errors_lane5, nvswitch_link_ecc_errors_lane6, nvswitch_link_ecc_errors_lane7errors/s
dcgm.nvswitch.thermal.temperaturenvswitch_temperature_current, nvswitch_temperature_limit_shutdown, nvswitch_temperature_limit_slowdownCelsius

Per cpu

These metrics refer to host CPU instances.

Labels:

LabelDescription
cpucpu label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.cpu.clock.frequencycpu_clock_currentMHz
dcgm.cpu.cpu.infocpu_model, cpu_vendorvalue
dcgm.cpu.cpu.powercpu_power_limit, cpu_power_util_currentWatts
dcgm.cpu.cpu.temperaturecpu_temp_critical, cpu_temp_current, cpu_temp_warningCelsius
dcgm.cpu.cpu.utilizationcpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user%
dcgm.cpu.diagnostics.resultsdiag_cpu_eud_resultstate

Per cpu_core

These metrics refer to host CPU core instances.

Labels:

LabelDescription
cpucpu label from exporter metrics.
cpucorecpucore label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.cpu_core.clock.frequencycpu_clock_currentMHz
dcgm.cpu_core.cpu.infocpu_model, cpu_vendorvalue
dcgm.cpu_core.cpu.powercpu_power_limit, cpu_power_util_currentWatts
dcgm.cpu_core.cpu.temperaturecpu_temp_critical, cpu_temp_current, cpu_temp_warningCelsius
dcgm.cpu_core.cpu.utilizationcpu_util, cpu_util_irq, cpu_util_nice, cpu_util_sys, cpu_util_user%
dcgm.cpu_core.diagnostics.resultsdiag_cpu_eud_resultstate

Per exporter

These metrics refer to exporter/global instances.

Labels:

LabelDescription
jobjob label from exporter metrics.

Metrics:

MetricDimensionsUnit
dcgm.exporter.health.statusbind_unbind_eventstate
dcgm.exporter.inventory.softwarecuda_driver_version, driver_version, nvml_versionvalue

Alerts

The following alerts are available:

Alert nameOn metricDescription
dcgm_gpu_xid_errors dcgm.gpu.reliability.xidNVIDIA driver reported GPU XID error on GPU ${label:gpu}
dcgm_gpu_row_remap_failure dcgm.gpu.reliability.row_remap_statusGPU row remapping failed on GPU ${label:gpu}
dcgm_gpu_uncorrectable_remapped_rows dcgm.gpu.reliability.row_remap_eventsUncorrectable remapped rows increased on GPU ${label:gpu}
dcgm_gpu_power_violation dcgm.gpu.throttle.violationsPower throttling detected on GPU ${label:gpu}
dcgm_gpu_thermal_violation dcgm.gpu.throttle.violationsThermal throttling detected on GPU ${label:gpu}

Setup

You can configure the dcgm collector in two ways:

MethodBest forHow to
UIFast setup without editing filesGo to Nodes → Configure this node → Collectors → Jobs, search for dcgm, then click + to add a job.
FileIf you prefer configuring via file, or need to automate deployments (e.g., with Ansible)Edit go.d/dcgm.conf and add a job.
important

UI configuration requires paid Netdata Cloud plan.

Prerequisites

Run dcgm-exporter

Install DCGM and run dcgm-exporter so that a Prometheus endpoint is available (default :9400/metrics).

Configure exporter field list

The default exporter profile exposes a small subset of fields. Use the Netdata recommended profile: dcgm-exporter-netdata.csv (raw download: https://raw.githubusercontent.com/netdata/netdata/master/src/go/plugin/go.d/collector/dcgm/dcgm-exporter-netdata.csv).

The Netdata profile enables 127 fields by default and documents all remaining known DCGM fields as commented entries. To customize beyond the baseline, uncomment the field you need and comment one currently enabled field.

Runtime validation artifact: src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.md and src/go/plugin/go.d/collector/dcgm/runtime-validation-driver-590.48.01-dcgm-exporter-4.4.1-4.5.2.json

Validation is primarily version-scoped (NVIDIA driver + DCGM/DCGM-exporter versions), so treat it as a strong baseline rather than universal compatibility.

Example: dcgm-exporter -f /path/to/dcgm-exporter-netdata.csv

Keep collection intervals aligned

Set Netdata update_every to the same value as dcgm-exporter collection interval (default 30 seconds). Example exporter interval: dcgm-exporter -c 30000 and Netdata update_every: 30.

Enable profiling capabilities (optional)

Profiling fields may require additional privileges/capabilities in your runtime environment.

Configuration

Options

The following options can be defined globally: update_every, autodetection_retry.

Config options
GroupOptionDescriptionDefaultRequired
Collectionupdate_everyData collection interval (seconds). Keep this aligned with dcgm-exporter collection interval.30no
autodetection_retryAutodetection retry interval (seconds). Set 0 to disable.0no
TargeturlDCGM exporter metrics endpoint URL.http://127.0.0.1:9400/metricsyes
timeoutHTTP request timeout (seconds).10no
Limitsmax_time_seriesGlobal time series limit. If exceeded, collection is skipped for this cycle.2000no
max_time_series_per_metricPer-metric time series limit. Metrics above this limit are skipped.200no
HTTP AuthusernameUsername for Basic HTTP authentication.no
passwordPassword for Basic HTTP authentication.no
bearer_token_filePath to a file containing a bearer token.no
TLStls_skip_verifySkip TLS certificate and hostname verification (insecure).nono
tls_caPath to CA bundle used to validate the server certificate.no
tls_certPath to client TLS certificate (for mTLS).no
tls_keyPath to client TLS private key (for mTLS).no
Proxyproxy_urlHTTP proxy URL.no
proxy_usernameUsername for proxy authentication.no
proxy_passwordPassword for proxy authentication.no
RequestheadersAdditional HTTP headers to include in the request.no
methodHTTP method.GETno
bodyHTTP request body.no
not_follow_redirectsDo not follow HTTP redirects.nono
force_http2Force HTTP/2 (including h2c over TCP).nono
Virtual NodevnodeAssociate this job with a Virtual Node.no

via UI

Configure the dcgm collector from the Netdata web interface:

  1. Go to Nodes.
  2. Select the node where you want the dcgm data-collection job to run and click the (Configure this node). That node will run the data collection.
  3. The Collectors → Jobs view opens by default.
  4. In the Search box, type dcgm (or scroll the list) to locate the dcgm collector.
  5. Click the + next to the dcgm collector to add a new job.
  6. Fill in the job fields, then click Test to verify the configuration and Submit to save.
    • Test runs the job with the provided settings and shows whether data can be collected.
    • If it fails, an error message appears with details (for example, connection refused, timeout, or command execution errors), so you can adjust and retest.

via File

The configuration file name for this integration is go.d/dcgm.conf.

The file format is YAML. Generally, the structure is:

update_every: 1
autodetection_retry: 0
jobs:
- name: some_name1
- name: some_name2

You can edit the configuration file using the edit-config script from the Netdata config directory.

cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config go.d/dcgm.conf
Examples
Local exporter

Collect metrics from a local dcgm-exporter endpoint.

Config
jobs:
- name: local
url: http://127.0.0.1:9400/metrics
update_every: 30

TLS endpoint

Collect metrics over HTTPS with custom CA certificate.

Config
jobs:
- name: secure
url: https://dcgm-exporter.example.com:9400/metrics
update_every: 30
tls_ca: /etc/netdata/certs/dcgm-ca.crt

Increased cardinality limits

Increase limits when collecting large field sets and multiple entities.

Config
jobs:
- name: dcgm_large
url: http://127.0.0.1:9400/metrics
update_every: 30
max_time_series: 10000
max_time_series_per_metric: 2000

Troubleshooting

Debug Mode

Important: Debug mode is not supported for data collection jobs created via the UI using the Dyncfg feature.

To troubleshoot issues with the dcgm collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn't working.

  • Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that's not the case on your system, open netdata.conf and look for the plugins setting under [directories].

    cd /usr/libexec/netdata/plugins.d/
  • Switch to the netdata user.

    sudo -u netdata -s
  • Run the go.d.plugin to debug the collector:

    ./go.d.plugin -d -m dcgm

    To debug a specific job:

    ./go.d.plugin -d -m dcgm -j jobName

Getting Logs

If you're encountering problems with the dcgm collector, follow these steps to retrieve logs and identify potential issues:

  • Run the command specific to your system (systemd, non-systemd, or Docker container).
  • Examine the output for any warnings or error messages that might indicate issues. These messages should provide clues about the root cause of the problem.

System with systemd

Use the following command to view logs generated since the last Netdata service restart:

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep dcgm

System without systemd

Locate the collector log file, typically at /var/log/netdata/collector.log, and use grep to filter for collector's name:

grep dcgm /var/log/netdata/collector.log

Note: This method shows logs from all restarts. Focus on the latest entries for troubleshooting current issues.

Docker Container

If your Netdata runs in a Docker container named "netdata" (replace if different), use this command:

docker logs netdata 2>&1 | grep dcgm

Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.