sst-elements
sst-elements copied to clipboard
SST memory statistics producing only 0's in output
Working off of sst-core/sst-elements devel branch from end of June:
Compiler and platform is:
gcc --version >&5
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
sst-core configured with:
./configure --prefix=/home/local/UFAD/aravindneela/local/sstcore --with-boost=/home/local/UFAD/aravindneela/local/packages/boost-1.56
sst-elements configured with:
./configure --prefix=/home/local/UFAD/aravindneela/local/sstelements --with-sst-core=/home/local/UFAD/aravindneela/local/sstcore --with-dramsim= --with-nvdimmsim= --with-hybridsim=/home/local/UFAD/aravindneela/local/packages/HybridSim --with-pin=/home/local/UFAD/aravindneela/local/packages/pin-2.14-71313-gcc.4.4.7-linux
Output of Ariel is:
Ariel Memory Management Statistics:
---------------------------------------------------------------------
Page Table Sizes:
- Map entries 0
Page Table Coverages:
- Bytes 0
PERFORMING SAVE OF CACHE TABLE!!!
got to save state in nvdimm
save file was state/nvdimm_restore.txt
NVDIMM is saving the used table, dirty table and address map
TLB Misses: 0
TLB Hits: 0
Total prefetches: 0
Unused prefetches in cache: 0
Unused prefetch victims: 0
Prefetch hit NOPs: 0
Prefetch cheat count: 0
Unique one misses: 0
Unique stream buffers: 0
Stream buffers hits: 0
PERFORMING SAVE OF CACHE TABLE!!!
got to save state in nvdimm
save file was state/nvdimm_restore.txt
NVDIMM is saving the used table, dirty table and address map
TLB Misses: 0
TLB Hits: 0
Total prefetches: 0
Unused prefetches in cache: 0
Unused prefetch victims: 0
Prefetch hit NOPs: 0
Prefetch cheat count: 0
Unique one misses: 0
Unique stream buffers: 0
Stream buffers hits: 0
PERFORMING SAVE OF CACHE TABLE!!!
got to save state in nvdimm
save file was state/nvdimm_restore.txt
NVDIMM is saving the used table, dirty table and address map
TLB Misses: 0
TLB Hits: 0
Total prefetches: 0
Unused prefetches in cache: 0
Unused prefetch victims: 0
Prefetch hit NOPs: 0
Prefetch cheat count: 0
Unique one misses: 0
Unique stream buffers: 0
Stream buffers hits: 0
Input Python script is
import sst
import os
next_core_id = 0
next_network_id = 0
next_memory_ctrl_id = 0
next_l3_cache_id = 0
clock = "2660MHz"
memory_clock = "200MHz"
coherence_protocol = "MESI"
cores_per_group = 2
memory_controllers_per_group = 1
groups = 4
os.environ['OMP_NUM_THREADS'] = str(cores_per_group * groups)
l3cache_blocks_per_group = 2
l3cache_block_size = "2MB"
l3_cache_per_core = int(l3cache_blocks_per_group / cores_per_group)
l3_cache_remainder = l3cache_blocks_per_group - (l3_cache_per_core * cores_per_group)
ring_latency = "300ps" # 2.66 GHz time period plus slack for ringstop latency
ring_bandwidth = "96GiB/s" # 2.66GHz clock, moves 64-bytes per cycle, plus overhead = 36B/c
ring_flit_size = "8B"
memory_network_bandwidth = "96GiB/s"
mem_interleave_size = 64 # Do 64B cache-line level interleaving
memory_capacity = 16384 # Size of memory in MBs
page_size = 4 # In KB
num_pages = memory_capacity * 1024 / page_size
streamN = 1000000
l1_prefetch_params = {
"prefetcher": "cassini.StridePrefetcher",
"prefetcher.reach": 4,
"prefetcher.detect_range" : 1
}
l2_prefetch_params = {
"prefetcher": "cassini.StridePrefetcher",
"prefetcher.reach": 16,
"prefetcher.detect_range" : 1
}
ringstop_params = {
"torus:shape" : groups * (cores_per_group + memory_controllers_per_group + l3cache_blocks_per_group),
"output_latency" : "25ps",
"xbar_bw" : ring_bandwidth,
"input_buf_size" : "2KiB",
"input_latency" : "25ps",
"num_ports" : "3",
"torus:local_ports" : "1",
"flit_size" : ring_flit_size,
"output_buf_size" : "2KiB",
"link_bw" : ring_bandwidth,
"torus:width" : "1",
"topology" : "merlin.torus"
}
# ariel cpu
ariel = sst.Component("a0", "ariel.ariel")
ariel.addParams({
"verbose" : 1,
"clock" : clock,
"maxcorequeue" : 256,
"maxissuepercycle" : 3,
"pipetimeout" : 0,
"corecount" : groups * cores_per_group,
"arielmode" : 0, # IMPORTANT: Assumes your application has an "ariel_enable()" in it, otherwise change to 1 (no enable - start simulation immediately) or 2 (auto-detect)
"executable" : "/home/local/UFAD/aravindneela/cmtbone2/nek5/examples/3dbox/./nek5000", # CHANGE THIS
#"executable" : "./hello", # CHANGE THIS
# New: memory manager is different for single & multi-level memory
# Single (default):
"memmgr" : "ariel.MemoryManagerSimple",
"memmgr.pagemappolicy" : "LINEAR", # or RANDOMIZED
"memmgr.pagesize0" : page_size * 1024,
"memmgr.pagecount0" : num_pages,
# Multi-level:
# Two pools. By default allocations occur to pool 0.
# Pool 0: size0 = level_0_pagesize * level_0_pagecount; Physical addresses will be 0 to size0-1;
# Pool 1: size1 = level_1_pagesize * level_1_pagecount; Physical addresses will be size0 to size0+size1-1;
# "memmgr" : "ariel.MemoryManagerMalloc",
#"memmgr.pagemappolicy" : "LINEAR",
#"memmgr.memorylevels" : 2, # Number of memory pools
#"memmgr.defaultlevel" : 0, # Default pool to allocate into
#"memmgr.pagesize0" : level_0_pagesize,
#"memmgr.pagecount0" : level_0_pagecount,
#"memmgr.pagesize1" : level_1_pagesize,
#"memmgr.pagecount1" : level_1_pagecount,
})
l1_params = {
"cache_frequency": clock,
"cache_size": "32KiB",
"associativity": 8,
"access_latency_cycles": 4,
"L1": 1,
# Default params
# "cache_line_size": 64,
# "coherence_protocol": coherence_protocol,
# "replacement_policy": "lru",
"maxRequestDelay" : "1000000",
}
l2_params = {
"cache_frequency": clock,
"cache_size": "256KiB",
"associativity": 8,
"access_latency_cycles": 6,
"mshr_num_entries" : 16,
"network_bw": ring_bandwidth,
# Default params
#"cache_line_size": 64,
#"coherence_protocol": coherence_protocol,
#"replacement_policy": "lru",
}
l3_params = {
"access_latency_cycles" : "12",
"cache_frequency" : clock,
"associativity" : "16",
"cache_size" : l3cache_block_size,
"mshr_num_entries" : "4096",
"network_bw": ring_bandwidth,
# Distributed caches
"num_cache_slices" : str(groups * l3cache_blocks_per_group),
"slice_allocation_policy" : "rr",
# Default params
# "replacement_policy" : "lru",
# "cache_line_size" : "64",
# "coherence_protocol" : coherence_protocol,
}
mem_params = {
"backend.mem_size" : str(memory_capacity / (groups * memory_controllers_per_group)) + "MiB",
"backend" : "memHierarchy.hybridsim",
"clock" : memory_clock,
"network_bw": ring_bandwidth,
"max_requests_per_cycle" : 1,
"do_not_back" : 1,
"backend.system_ini" : "/home/local/UFAD/aravindneela/scratch/src/foraravind/hybridsim.ini",
#"backend.device_ini" : "/home/taniabanerjee/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini",
}
dc_params = {
"interleave_size": str(mem_interleave_size) + "B",
"interleave_step": str((groups * memory_controllers_per_group) * (mem_interleave_size)) + "B",
"entry_cache_size": 256*1024*1024, #Entry cache size of mem/blocksize
"clock": memory_clock,
"network_bw": ring_bandwidth,
# Default params
# "coherence_protocol": coherence_protocol,
}
router_map = {}
print "Configuring Ring Network-on-Chip..."
for next_ring_stop in range((cores_per_group + memory_controllers_per_group + l3cache_blocks_per_group) * groups):
ring_rtr = sst.Component("rtr." + str(next_ring_stop), "merlin.hr_router")
ring_rtr.addParams(ringstop_params)
ring_rtr.addParams({
"id" : next_ring_stop
})
router_map["rtr." + str(next_ring_stop)] = ring_rtr
for next_ring_stop in range((cores_per_group + memory_controllers_per_group + l3cache_blocks_per_group) * groups):
if next_ring_stop == 0:
rtr_link_positive = sst.Link("rtr_pos_" + str(next_ring_stop))
rtr_link_positive.connect( (router_map["rtr.0"], "port0", ring_latency), (router_map["rtr.1"], "port1", ring_latency) )
rtr_link_negative = sst.Link("rtr_neg_" + str(next_ring_stop))
rtr_link_negative.connect( (router_map["rtr.0"], "port1", ring_latency), (router_map["rtr." + str(((cores_per_group + memory_controllers_per_group + l3cache_blocks_per_group) * groups) - 1)], "port0", ring_latency) )
elif next_ring_stop == ((cores_per_group + memory_controllers_per_group + l3cache_blocks_per_group) * groups) - 1:
rtr_link_positive = sst.Link("rtr_pos_" + str(next_ring_stop))
rtr_link_positive.connect( (router_map["rtr." + str(next_ring_stop)], "port0", ring_latency), (router_map["rtr.0"], "port1", ring_latency) )
rtr_link_negative = sst.Link("rtr_neg_" + str(next_ring_stop))
rtr_link_negative.connect( (router_map["rtr." + str(next_ring_stop)], "port1", ring_latency), (router_map["rtr." + str(next_ring_stop-1)], "port0", ring_latency) )
else:
rtr_link_positive = sst.Link("rtr_pos_" + str(next_ring_stop))
rtr_link_positive.connect( (router_map["rtr." + str(next_ring_stop)], "port0", ring_latency), (router_map["rtr." + str(next_ring_stop+1)], "port1", ring_latency) )
rtr_link_negative = sst.Link("rtr_neg_" + str(next_ring_stop))
rtr_link_negative.connect( (router_map["rtr." + str(next_ring_stop)], "port1", ring_latency), (router_map["rtr." + str(next_ring_stop-1)], "port0", ring_latency) )
for next_group in range(groups):
print "Configuring core and memory controller group " + str(next_group) + "..."
for next_active_core in range(cores_per_group):
for next_l3_cache_block in range(l3_cache_per_core):
print "Creating L3 cache block " + str(next_l3_cache_id) + "..."
l3cache = sst.Component("l3cache_" + str(next_l3_cache_id), "memHierarchy.Cache")
l3cache.addParams(l3_params)
l3cache.addParams({
"network_address" : next_network_id,
"slice_id" : str(next_l3_cache_id)
})
l3_ring_link = sst.Link("l3_" + str(next_l3_cache_id) + "_link")
l3_ring_link.connect( (l3cache, "directory", ring_latency), (router_map["rtr." + str(next_network_id)], "port2", ring_latency) )
next_l3_cache_id = next_l3_cache_id + 1
next_network_id = next_network_id + 1
print "Creating Core " + str(next_active_core) + " in Group " + str(next_group)
l1 = sst.Component("l1cache_" + str(next_core_id), "memHierarchy.Cache")
l1.addParams(l1_params)
#l1.addParams(l1_prefetch_params)
l2 = sst.Component("l2cache_" + str(next_core_id), "memHierarchy.Cache")
l2.addParams({
"network_address" : next_network_id
})
l2.addParams(l2_params)
#l2.addParams(l2_prefetch_params)
arielL1Link = sst.Link("cpu_cache_link_" + str(next_core_id))
arielL1Link.connect((ariel, "cache_link_%d"%next_core_id, ring_latency), (l1, "high_network_0", ring_latency))
arielL1Link.setNoCut()
l2_core_link = sst.Link("l2cache_" + str(next_core_id) + "_link")
l2_core_link.connect((l1, "low_network_0", ring_latency), (l2, "high_network_0", ring_latency))
l2_core_link.setNoCut()
l2_ring_link = sst.Link("l2_ring_link_" + str(next_core_id))
l2_ring_link.connect((l2, "cache", ring_latency), (router_map["rtr." + str(next_network_id)], "port2", ring_latency))
next_network_id = next_network_id + 1
next_core_id = next_core_id + 1
for next_l3_cache_block in range(l3_cache_remainder):
print "Creating L3 cache block: " + str(next_l3_cache_id) + "..."
l3cache = sst.Component("l3cache_" + str(next_l3_cache_id), "memHierarchy.Cache")
l3cache.addParams(l3_params)
l3cache.addParams({
"network_address" : next_network_id,
"slice_id" : str(next_l3_cache_id)
})
l3_ring_link = sst.Link("l3_" + str(next_l3_cache_id) + "_link")
l3_ring_link.connect( (l3cache, "directory", ring_latency), (router_map["rtr." + str(next_network_id)], "port2", ring_latency) )
next_l3_cache_id = next_l3_cache_id + 1
next_network_id = next_network_id + 1
for next_mem_ctrl in range(memory_controllers_per_group):
local_size = memory_capacity / (groups * memory_controllers_per_group)
mem = sst.Component("memory_" + str(next_memory_ctrl_id), "memHierarchy.MemController")
mem.addParams(mem_params)
dc = sst.Component("dc_" + str(next_memory_ctrl_id), "memHierarchy.DirectoryController")
dc.addParams({
"network_address" : next_network_id,
"addr_range_start" : next_memory_ctrl_id * mem_interleave_size,
"addr_range_end" : (memory_capacity * 1024 * 1024) - (groups * memory_controllers_per_group * mem_interleave_size) + (next_memory_ctrl_id * mem_interleave_size)
})
dc.addParams(dc_params)
memLink = sst.Link("mem_link_" + str(next_memory_ctrl_id))
memLink.connect((mem, "direct_link", ring_latency), (dc, "memory", ring_latency))
netLink = sst.Link("dc_link_" + str(next_memory_ctrl_id))
netLink.connect((dc, "network", ring_latency), (router_map["rtr." + str(next_network_id)], "port2", ring_latency))
next_network_id = next_network_id + 1
next_memory_ctrl_id = next_memory_ctrl_id + 1
# ===============================================================================
# Enable SST Statistics Outputs for this simulation
sst.setStatisticLoadLevel(16)
sst.enableAllStatisticsForAllComponents({"type":"sst.AccumulatorStatistic"})
sst.setStatisticOutput("sst.statOutputCSV")
sst.setStatisticOutputOptions( {
"filepath" : "./stats-snb-ariel-dram.csv",
"separator" : ", "
} )
print "Completed configuring the SST Sandy Bridge model"
Can you post the Ariel output before the "Ariel Memory Management Statistics:"? Is there an 'ariel_enable()' call in the application you are running?
@gvoskuilen - the deck is assuming ariel_enable()
is called so that would be my guess too.
Hi, @nmhamster - The ariel_enable in the python script is set to 0 which means it is enabled. Please correct me if I am wrong.
@gvoskuilen - The output before Ariel management is:
Configuring Ring Network-on-Chip... Configuring core and memory controller group 0... Creating L3 cache block 0... Creating Core 0 in Group 0 Creating L3 cache block 1... Creating Core 1 in Group 0 Configuring core and memory controller group 1... Creating L3 cache block 2... Creating Core 0 in Group 1 Creating L3 cache block 3... Creating Core 1 in Group 1 Configuring core and memory controller group 2... Creating L3 cache block 4... Creating Core 0 in Group 2 Creating L3 cache block 5... Creating Core 1 in Group 2 Configuring core and memory controller group 3... Creating L3 cache block 6... Creating Core 0 in Group 3 Creating L3 cache block 7... Creating Core 1 in Group 3 Completed configuring the SST Sandy Bridge model ArielComponent[arielcpu.cc:52:ArielCPU] Creating Ariel component... ArielComponent[arielcpu.cc:55:ArielCPU] Configuring for 8 cores... ArielComponent[arielcpu.cc:58:ArielCPU] Configuring for check addresses = no ArielComponent[arielcpu.cc:140:ArielCPU] Loading memory manger: ariel.MemoryManagerSimple ArielComponent[arielcpu.cc:148:ArielCPU] Memory manager construction is completed. ArielComponent[arielcpu.cc:179:ArielCPU] Model specifies that there are 0 application arguments ArielComponent[arielcpu.cc:186:ArielCPU] Interception and re-instrumentation of multi-level memory calls is DISABLED. ArielComponent[arielcpu.cc:194:ArielCPU] Tracking the stack and dumping on malloc calls is DISABLED. ArielComponent[arielcpu.cc:199:ArielCPU] Base pipe name: /sst_shmem_24476-0-1681692777 ArielComponent[arielcpu.cc:210:ArielCPU] Processing application arguments... ArielComponent[arielcpu.cc:313:ArielCPU] Completed processing application arguments. ArielComponent[arielcpu.cc:320:ArielCPU] Creating core to cache links... ArielComponent[arielcpu.cc:330:ArielCPU] Creating processor cores and cache links... ArielComponent[arielcpu.cc:333:ArielCPU] Configuring cores and cache links... ArielComponent[arielcpu.cc:357:ArielCPU] Registering ArielCPU clock at 2660MHz ArielComponent[arielcpu.cc:361:ArielCPU] Clocks registered. ArielComponent[arielcpu.cc:369:ArielCPU] Completed initialization of the Ariel CPU. l3cache_0: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_0: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. l3cache_1: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_1: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. Creating DRAM with /home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini == Loading device model file '/home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini' == == Loading system model file '/home/local/UFAD/aravindneela/scratch/src/sst-elements/src/sst/elements/memHierarchy/tests/system.ini' == WARNING: UNKNOWN KEY 'DEBUG_TRANS_FLOW' IN INI FILE ===== MemorySystem 0 ===== CH. 0 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank ===== MemorySystem 1 ===== CH. 1 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank Creating Flash with /home/local/UFAD/aravindneela/scratch/src/NVDIMMSim/src/ini/samsung_K9XXG08UXM_gc_test.ini Done with creating memories l3cache_2: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_2: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. l3cache_3: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_3: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. Creating DRAM with /home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini == Loading device model file '/home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini' == == Loading system model file '/home/local/UFAD/aravindneela/scratch/src/sst-elements/src/sst/elements/memHierarchy/tests/system.ini' == WARNING: UNKNOWN KEY 'DEBUG_TRANS_FLOW' IN INI FILE ===== MemorySystem 0 ===== CH. 0 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank ===== MemorySystem 1 ===== CH. 1 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank Creating Flash with /home/local/UFAD/aravindneela/scratch/src/NVDIMMSim/src/ini/samsung_K9XXG08UXM_gc_test.ini Done with creating memories l3cache_4: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_4: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. l3cache_5: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_5: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. Creating DRAM with /home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini == Loading device model file '/home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini' == == Loading system model file '/home/local/UFAD/aravindneela/scratch/src/sst-elements/src/sst/elements/memHierarchy/tests/system.ini' == WARNING: UNKNOWN KEY 'DEBUG_TRANS_FLOW' IN INI FILE ===== MemorySystem 0 ===== CH. 0 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank ===== MemorySystem 1 ===== CH. 1 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank Creating Flash with /home/local/UFAD/aravindneela/scratch/src/NVDIMMSim/src/ini/samsung_K9XXG08UXM_gc_test.ini Done with creating memories l3cache_6: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_6: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. l3cache_7: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 3 cycles. l2cache_7: No MSHR lookup latency provided (mshr_latency_cycles)...intrapolated to 2 cycles. Creating DRAM with /home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini == Loading device model file '/home/local/UFAD/aravindneela/scratch/src/DRAMSim2/ini/DDR3_micron_32M_8B_x4_sg125.ini' == == Loading system model file '/home/local/UFAD/aravindneela/scratch/src/sst-elements/src/sst/elements/memHierarchy/tests/system.ini' == WARNING: UNKNOWN KEY 'DEBUG_TRANS_FLOW' IN INI FILE ===== MemorySystem 0 ===== CH. 0 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank ===== MemorySystem 1 ===== CH. 1 TOTAL_STORAGE : 2048MB | 1 Ranks | 16 Devices per rank Creating Flash with /home/local/UFAD/aravindneela/scratch/src/NVDIMMSim/src/ini/samsung_K9XXG08UXM_gc_test.ini Done with creating memories ArielComponent[arielcpu.cc:377:init] Launching PIN... ArielComponent[arielcpu.cc:488:forkPINChild] Launching executable: /home/local/UFAD/aravindneela/local/packages/pin-2.14-71313-gcc.4.4.7-linux/pin.sh... SSTARIEL: Loading Ariel Tool to connect to SST on pipe: /sst_shmem_24476-0-1681692777 max instruction count: 1000000000 max core count: 8 SSTARIEL: Function profiling is disabled. ARIEL-SST: Did not find ARIEL_OVERRIDE_POOL in the environment, no override applies. ARIEL-SST PIN tool activating with 8 threads ArielComponent[arielcpu.cc:382:init] Returned from launching PIN. Waiting for child to attach. ArielComponent[arielcpu.cc:385:init] Child has attached! DRAMSim2 Clock Frequency =1Hz, CPU Clock Frequency=1Hz DRAMSim2 Clock Frequency =1Hz, CPU Clock Frequency=1Hz DRAMSim2 Clock Frequency =1Hz, CPU Clock Frequency=1Hz DRAMSim2 Clock Frequency =1Hz, CPU Clock Frequency=1Hz ARIEL: Default memory pool set to 0 ARIEL: Tool is configured to suspend profiling until program control ARIEL: Starting program. Identified routine: clock_gettime, replacing with Ariel equivalent... Replacement complete. Program output SSTARIEL: Execution completed, shutting down. CORE ID: 0 PROCESSED AN EXIT EVENT PERFORMING SAVE OF CACHE TABLE!!! got to save state in nvdimm save file was state/nvdimm_restore.txt NVDIMM is saving the used table, dirty table and address map TLB Misses: 0 TLB Hits: 0 Total prefetches: 0 Unused prefetches in cache: 0 Unused prefetch victims: 0 Prefetch hit NOPs: 0 Prefetch cheat count: 0 Unique one misses: 0 Unique stream buffers: 0 Stream buffers hits: 0 ArielComponent[arielcpu.cc:394:finish] Ariel Processor Information: ArielComponent[arielcpu.cc:395:finish] Completed at: 208928605 nanoseconds. ArielComponent[arielcpu.cc:396:finish] Ariel Component Statistics (By Core)
Ariel Memory Management Statistics:
Page Table Sizes:
- Map entries 0 Page Table Coverages:
- Bytes 0 PERFORMING SAVE OF CACHE TABLE!!! got to save state in nvdimm save file was state/nvdimm_restore.txt NVDIMM is saving the used table, dirty table and address map TLB Misses: 0 TLB Hits: 0 Total prefetches: 0 Unused prefetches in cache: 0 Unused prefetch victims: 0 Prefetch hit NOPs: 0 Prefetch cheat count: 0 Unique one misses: 0 Unique stream buffers: 0 Stream buffers hits: 0 PERFORMING SAVE OF CACHE TABLE!!! got to save state in nvdimm save file was state/nvdimm_restore.txt NVDIMM is saving the used table, dirty table and address map TLB Misses: 0 TLB Hits: 0 Total prefetches: 0 Unused prefetches in cache: 0 Unused prefetch victims: 0 Prefetch hit NOPs: 0 Prefetch cheat count: 0 Unique one misses: 0 Unique stream buffers: 0 Stream buffers hits: 0 PERFORMING SAVE OF CACHE TABLE!!! got to save state in nvdimm save file was state/nvdimm_restore.txt NVDIMM is saving the used table, dirty table and address map TLB Misses: 0 TLB Hits: 0 Total prefetches: 0 Unused prefetches in cache: 0 Unused prefetch victims: 0 Prefetch hit NOPs: 0 Prefetch cheat count: 0 Unique one misses: 0 Unique stream buffers: 0 Stream buffers hits: 0 Simulation is complete, simulated time: 208.929 ms
I have replaced the actual output from the program with "program output" since it was a very long output. The application was executed perfectly but the memory statistics is still giving trouble.
@aravindneela The above shows that ariel is configured to wait until it finds an 'ariel_enable()' call in the program - "ARIEL: Tool is configured to suspend profiling until program control". Ariel is not finding this call in your program so Ariel is never triggered to start simulating. There are two ways to fix this - set 'arielmode' to '1' in the config, or add a call to 'ariel_enable()' in your program. The ariel API, including 'ariel_enable()' can be found in the sst-tools repo in sst-tools/tools/ariel/api/ariel.h (use the devel branch from github).
Thanks @gvoskuilen. That worked! I have another doubt. I am pasting part of the output from the terminal after running the script (snb-ariel-dram.py) using SST below (in italics): PERFORMING SAVE OF CACHE TABLE!!! TLB Misses: 0 TLB Hits: 0 Total prefetches: 0 Unused prefetches in cache: 0 Unused prefetch victims: 0 Prefetch hit NOPs: 0 Prefetch cheat count: 0 Unique one misses: 0 Unique stream buffers: 0 Stream buffers hits: 0
This set repeats 4 times and is always 0. Is this normal? Is there any documentation I can refer to understand these outputs?
@aravindneela That output is from HybridSim so I would take a look at their documentation. Four repetitions is probably one from each of the HybridSim instances.
@gvoskuilen Another question on this same setup. How does one execute mpirun with Ariel? An earlier version of SST (perhaps 6.0) did allow it to be specified using the "executable" parameter to Ariel (e.g. "executable": "mpirun -n 4 /home/tania/nek5000"), but that does not work with the latest devel version.
@taniabanerjee Ariel doesn't support applications running MPI because it conflicts with SST's use of MPI. Earlier versions of Ariel haven't supported it either and so your simulations would have been relying on unsupported behavior. Is simulating MPI applications essential to your work?
@gvoskuilen Our application CMT-bone is MPI based. So we are looking for ways to simulate MPI programs. In your paper "The Potentials and Perils of Multi-Level Memory", you used number of parallel codes, though, such as miniFE, LULESH, MineAero, and RSBench. Did you have OpenMP versions for these?
@taniabanerjee Yes we were running OpenMP versions of those.
@gvoskuilen Is there any other way to execute pure MPI programs (say 4 ranks per node) through SST configured for multi-level memories?
Hi @gvoskuilen, I have a doubt with respect to running the snb-ariel-dram.py script on sst. While running the command "sst -n 4 snb-ariel-dram.py", is the -n used for running sst in parallel with 4 cores or is it used for simulating the application on 4 processor of the system being simulated?
@taniabanerjee No there is not unfortunately. If you had traces of each MPI rank you could run them using Prospero, but it's a simple memory-trace execution model and I'm not sure if it would be sufficient for what you want. Also, Prospero does not have any support for multi-level memory (e.g., the ability to map pages and allocations to different memories) so you may need to add that if you went that route.
@aravindneela The 'n' is used for running SST in parallel with 4 threads.
This is fixed.