Analyze Simulation Results

In this page, we will explain how to analyze simulation results.

Analyze SimpleSSD Log and Statistics

SimpleSSD prints debug log which contains all information about I/O requests and statistics of all components.

SimpleSSD debug log

For SimpleSSD-FullSystem, you can find debug log on stdout of gem5. For SimpleSSD-Standalone, you can define where to save debug log by [global] DebugLogFile simulation option.

Here is an example of debug log when read-ahead triggered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
861004146876: HIL::NVMe: SQ 3    | Submission Queue Tail Doorbell | Item count in queue 0 -> 1 | head 432 | tail 432 -> 433
861004362088: HIL::NVMe: NVM     | READ  | NSID 1     | 1B0 + 1
861004362088: HIL: READ  | REQ     477 | LCA 432 + 1 | BYTE 0 + 4096
861004362088: ICL::GenericCache: READ  | REQ     477-1    | LCA 432 | SIZE 4096
861004362088: ICL::GenericCache: READ  | Cache hit at (0, 71) | 861004362088 - 861005056598 (694510)
861004362088: ICL::GenericCache: READ  | Prefetch triggered
861004362088: ICL::GenericCache: ----- | Begin eviction
861004362088: ICL::GenericCache: ----- | End eviction | 861008457768 - 861008457768 (0)
861004362088: FTL: READ  | LPN 5
861004362088: PAL::PALOLD: READ  | Block 2555 | Page 1
861004362088: PAL::PALOLD: READ  | C     0 | W     0 | D     0 | P     0 | B   511 | P     1
861004362088: FTL::PageMapping: READ  | LPN 5 | 861008457768 - 861126478092 (118020324)
861004362088: ICL::GenericCache: READ  | Cache miss at (0, 77) | 861008457768 - 861126775592 (118317824)

...

861004362088: FTL: READ  | LPN 5
861004362088: PAL::PALOLD: READ  | Block 2555 | Page 1
861004362088: PAL::PALOLD: READ  | C    11 | W     0 | D     0 | P     0 | B   511 | P     1
861004362088: FTL::PageMapping: READ  | LPN 5 | 861008457768 - 861126478092 (118020324)
861004362088: ICL::GenericCache: READ  | Cache miss at (3, 77) | 861008457768 - 861126775592 (118317824)
861004362088: FTL: READ  | LPN 5
861004362088: PAL::PALOLD: READ  | Block 2555 | Page 1
861004362088: PAL::PALOLD: READ  | C     0 | W     0 | D     1 | P     0 | B   511 | P     1
861004362088: FTL::PageMapping: READ  | LPN 5 | 861008457768 - 861138228620 (129770852)
861004362088: ICL::GenericCache: READ  | Cache miss at (4, 77) | 861008457768 - 861138526120 (130068352)

...

861004362088: FTL: READ  | LPN 5
861004362088: PAL::PALOLD: READ  | Block 2555 | Page 1
861004362088: PAL::PALOLD: READ  | C    11 | W     0 | D     7 | P     0 | B   511 | P     1
861004362088: FTL::PageMapping: READ  | LPN 5 | 861008457768 - 861208731788 (200274020)
861004362088: ICL::GenericCache: READ  | Cache miss at (7, 87) | 861008457768 - 861209029288 (200571520)
861004362088: ICL::GenericCache: READ  | Prefetch done
861004362088: ICL: READ  | LCA 432 + 1 | 861004362088 - 861006961598 (2599510)
861007304098: HIL::NVMe: NVM     | READ  | NSID 1     | 1B0 + 1 | 861004362088 - 861007304098 (2942010)
861007444114: HIL::NVMe: INTR    | MSI-X sent | vector 2
861008760480: HIL::NVMe: CQ 3    | Completion Queue Head Doorbell | Item count in queue 1 -> 0 | head 432 -> 433 | tail 433

As shown at line 6, prefetch has been triggered because prefetch condition has been satisfied. GenericCache module of SimpleSSD will read one super page or all pages from all planes (you can configure this).

At line 13 and 21, you can check the latency calculated by PAL has same value (118,317,824 ps). These two requests are targets different channel, so they can operate simultaneously (They are accessing channel 0 and 11, respectively). At line 26, the latency value changed (130,068,352 ps), because channel 0 has previous request, so PAL determined there was channel conflicts. At line 34, request targeting die 7 in channel 11 delayed about 90us because of previous I/Os targeting 7 dies (die 0 to 6).

After prefetch, GenericCache return request as shown at line 36. The latency is quite small because data for the request is already exists on the cache (see line 5).

By analyzing debug log, you can find characteristics of each I/O and which component consumes large portion of I/O latency.

SimpleSSD Statistics

For SimpleSSD-FullSystem, you can find statistics at m5out/stats.txt. For SimpleSSD-Standalone, not yet supported (at point of writing v2.0-rc3).

Here is the example of statistics. This file is generated after performing I/O using fio with --direct=1 --ioengine=libaio --bs=4K --iodepth=1 --rw=read --time_based --runtime=5s.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
system.pci_nvme.command_count                  413480                       # Total number of NVMe command handled
system.pci_nvme.read.request_count             413480                       # Read request count
system.pci_nvme.read.bytes                 1693614080                       # Read data size in byte
system.pci_nvme.read.busy                1653776508704                       # Device busy time when read
system.pci_nvme.write.request_count                 0                       # Write request count
system.pci_nvme.write.bytes                         0                       # Write data size in byte
system.pci_nvme.write.busy                          0                       # Device busy time when write
system.pci_nvme.request_count                  413480                       # Total request count
system.pci_nvme.bytes                      1693614080                       # Total data size in byte
system.pci_nvme.busy                     1653776508704                       # Total device busy time
system.pci_nvme.icl.generic_cache.read.request_count       413480                       # Read request count
system.pci_nvme.icl.generic_cache.read.from_cache       413224                       # Read requests that served from cache
system.pci_nvme.icl.generic_cache.write.request_count            0                       # Write request count
system.pci_nvme.icl.generic_cache.write.to_cache            0                       # Write requests that served to cache
system.pci_nvme.dram.energy.act          351568356.208920                       # ACT command energy (pJ)
system.pci_nvme.dram.energy.pre          437367776.459918                       # PRE command energy (pJ)
system.pci_nvme.dram.energy.rd           368277927.504022                       # RD command energy (pJ)
system.pci_nvme.dram.energy.wr           198140164.999281                       # WR command energy (pJ)
system.pci_nvme.dram.energy.act.standby  1992764428.068482                       # ACT standby energy (pJ)
system.pci_nvme.dram.energy.pre.standby  146846141074.948761                       # PRE standby energy (pJ)
system.pci_nvme.dram.energy.ref          2589408.045731                       # Refresh energy (pJ)
system.pci_nvme.dram.read.request_count        826662                       # Read request count
system.pci_nvme.dram.read.bytes            1695873008                       # Read data size in byte
system.pci_nvme.dram.write.request_count       413438                       # Write request count
system.pci_nvme.dram.write.bytes           1693442048                       # Write data size in byte
system.pci_nvme.dram.request_count            1240100                       # Total request count
system.pci_nvme.dram.bytes                 3389315056                       # Total data size in byte
system.pci_nvme.ftl.page_mapping.gc_count            0                       # Total GC count
system.pci_nvme.ftl.page_mapping.reclaimed_blocks            0                       # Total reclaimed blocks in GC
system.pci_nvme.pal.energy.read          2892824.972693                       # NAND power used on read transaction (uJ)
system.pci_nvme.pal.energy.program                  0                       # NAND power used on program transaction (uJ)
system.pci_nvme.pal.energy.erase                    0                       # NAND power used on erase transaction (uJ)
system.pci_nvme.pal.energy.total         2892824.972693                       # NAND power used on all transaction (uJ)
system.pci_nvme.cpu.icl0.busy            929296300000                       # CPU for ICL core 0 busy ticks
system.pci_nvme.cpu.icl0.insts.branch        31837960                       # CPU for ICL core 0 executed branch instructions
system.pci_nvme.cpu.icl0.insts.load         193508640                       # CPU for ICL core 0 executed load instructions
system.pci_nvme.cpu.icl0.insts.store         26049240                       # CPU for ICL core 0 executed store instructions
system.pci_nvme.cpu.icl0.insts.arithmetic    118668760                       # CPU for ICL core 0 executed arithmetic instructions
system.pci_nvme.cpu.icl0.insts.fp                   0                       # CPU for ICL core 0 executed floating point instructions
system.pci_nvme.cpu.icl0.insts.others         1653920                       # CPU for ICL core 0 executed other instructions
system.pci_nvme.cpu.ftl0.busy            373127795000                       # CPU for FTL core 0 busy ticks
system.pci_nvme.cpu.ftl0.insts.branch        21085338                       # CPU for FTL core 0 executed branch instructions
system.pci_nvme.cpu.ftl0.insts.load          74418840                       # CPU for FTL core 0 executed store instructions
system.pci_nvme.cpu.ftl0.insts.store         12403140                       # CPU for FTL core 0 executed load instructions
system.pci_nvme.cpu.ftl0.insts.arithmetic     39276610                       # CPU for FTL core 0 executed arithmetic instructions
system.pci_nvme.cpu.ftl0.insts.fp                   0                       # CPU for FTL core 0 executed floating point instructions
system.pci_nvme.cpu.ftl0.insts.others         2067190                       # CPU for FTL core 0 executed other instructions

Each line contains one statistic for components. You can find DRAMPower result at line 15-21. This SSD has three cores, each core is dedicated for each firmware component. You can find CPU statistics at line 34-47.

This stats.txt file is generated by gem5, which can controlled in user-level space through m5 utility.

Analyze Log from Linux Kernel

To analyze kernel behavior, you need to modify the kernel source code. In SimpleSSD-FullSystem, we added new M5 pseudo-instruction (M5OP) called m5print. This instruction prints string to debug log of gem5 (specified by --debug-flag and --debug-file options).

Prepare Linux Kernel

In this section, we will explain how to add M5OP to the Linux Kernel. We assumes ARMv8 (arm64) architecture.

Copy m5ops.h and m5op_flags.h files in include/gem5/asm/generic directory of SimpleSSD-FullSystem to arch/arm64/include/asm of Linux Kernel. Modify m5ops.h file to change include path of m5op_flags.h.

// From
#include <gem5/asm/generic/m5op_flags.h>

// To
#include <asm/m5op_flags.h>

Copy m5op_arm_A64.S file in util/m5 directory of SimpleSSD-FullSystem to arch/arm64/kernel of Linux Kernel. Modify m5op_arm_A64.S file to change include path of m5ops.h

// From
#include <gem5/asm/generic/m5ops.h>

// To
#include <asm/m5ops.h>

Add following line to Makefile in arch/arm64/kernel (line number may different).

16
17
18
19
20
21
22
23
24
25
26
# Object file lists.
arm64-obj-y             := debug-monitors.o entry.o irq.o fpsimd.o              \
                           entry-fpsimd.o process.o ptrace.o setup.o signal.o   \
                           sys.o stacktrace.o time.o traps.o io.o vdso.o        \
                           hyp-stub.o psci.o cpu_ops.o insn.o   \
                           return_address.o cpuinfo.o cpu_errata.o              \
                           cpufeature.o alternative.o cacheinfo.o               \
                           smp.o smp_spin_table.o topology.o smccc-call.o       \
                           m5op_arm_A64.o

extra-$(CONFIG_EFI)                     := efi-entry.o

Copy m5ops.h file in include/gem5 directory of SimpleSSD to include/linux of Linux Kernel. Do not confuse with m5ops.h which copied to arch/arm64/include/asm directory.

Modify include/linux/m5ops.h.

// From
#include <stdint.h>
#include <time.h>

#include <gem5/asm/generic/m5op_flags.h>

// To
#include <linux/types.h>
#include <linux/time.h>

#include <asm/m5op_flags.h>

Also you need to modify following lines too.

// From line 49 and 66
uint64_t m5_rpns();
void m5_load_symbol();

// To
uint64_t m5_rpns(void);
void m5_load_symbol(void);

Now you are ready to use M5OP inside the Linux Kernel.

Modify Kernel

Now, let’s add m5_print function where you want to analyze.

Objective: measure latency of MSI-X interrupt.

As SimpleSSD sends MSI-X interrupt (see above section for debug log), you can measure latency until IRQ handler of NVMe driver called. Modify drivers/nvme/host/pci.c (line number may different).

854
855
856
857
858
859
860
861
862
863
864
865
static irqreturn_t nvme_irq(int irq, void *data)
{
        irqreturn_t result;
        struct nvme_queue *nvmeq = data;
        m5_print("nvme_irq", 8);
        spin_lock(&nvmeq->q_lock);
        nvme_process_cq(nvmeq);
        result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
        nvmeq->cqe_seen = 0;
        spin_unlock(&nvmeq->q_lock);
        return result;
}

Don’t forget to add #include <linux/m5ops.h> to beginning of the file.

Build Linux Kernel and copy it to $M5_PATH/binaries. For kernel build instructions, see Build Linux Kernel for gem5.

Analyze result

Boot SimpleSSD-FullSystem with newly built kernel image. As we want to check MSI-X, terminate simulation with m5 exit after kernel boot.

Here is the example of m5out/debug.txt and m5out/simout.

1
2
3
4
5
5361549000: system.terminal: attach terminal 0
436771014500: global: Log from guest: nvme_irq
436777643000: global: Log from guest: nvme_irq
437013694000: global: Log from guest: nvme_irq
437021837500: global: Log from guest: nvme_irq
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
436732767000: HIL::NVMe: BAR0    | WRITE | Controller Configuration
436732767000: HIL::NVMe: DMAPORT | WRITE | DATA 00460001
436732786500: HIL::NVMe: BAR0    | READ  | Controller Status
436732786500: HIL::NVMe: DMAPORT | READ  | DATA 00000001
436763277000: HIL::NVMe: BAR0    | READ  | Version
436763277000: HIL::NVMe: DMAPORT | READ  | DATA 00010201
436763284000: HIL::NVMe: BAR0    | READ  | Controller Capabilities
436763284000: HIL::NVMe: DMAPORT | READ  | DATA 0020002028010FFF
436765154500: HIL::NVMe: SQ 0    | Submission Queue Tail Doorbell | Item count in queue 0 -> 1 | head 0 | tail 0 -> 1
436768910422: HIL::NVMe: ADMIN   | Identify | CNS 1 | CNTID 0 | NSID 0
436770652297: HIL::NVMe: INTR    | MSI-X sent | vector 0
436771375000: HIL::NVMe: CQ 0    | Completion Queue Head Doorbell | Item count in queue 1 -> 0 | head 0 -> 1 | tail 1
436776059000: HIL::NVMe: SQ 0    | Submission Queue Tail Doorbell | Item count in queue 0 -> 1 | head 1 | tail 1 -> 2
436777053844: HIL::NVMe: ADMIN   | Set Features | Feature 7 | NSID 0
436777179719: HIL::NVMe: INTR    | MSI-X sent | vector 0
436778003500: HIL::NVMe: CQ 0    | Completion Queue Head Doorbell | Item count in queue 1 -> 0 | head 1 -> 2 | tail 2

In the simout (lower code block), MSI-X sent at 436,770,652,297 ps (436ms). In the debug.txt (upper code block), nvme_irq function called at 436,771,014,500 ps (436ms). The difference was 362,203 ps (362ns).

The analysis method using M5OP has very low overhead (few CPU cycles) because it based on the instruction. You can add any number of m5_print function to the Linux Kernel or even in user-level program.

To generate debug.txt file, you should specify --debug-flag=M5Print and --debug-file=debug.txt when executing SimpleSSD-FullSystem.

Analyze Debug Log and Statistics from Simulator

Also you can analyze debug log and statistics of simulator itself.

SimpleSSD-FullSystem

Similar to Analyze SimpleSSD Log and Statistics, gem5 uses --debug-file and stats.txt to print detailed debug log of hardware components and statistics. stats.txt file can be analyzed with same method of SimpleSSD. We will explain how to use --debug-flag and --debug-file options of gem5.

--debug-file specifies where to save hardware debug log (see Execution of SimpleSSD-FullSystem). --debug-flag specifies which hardware modules to enable debug log printout.

The M5Print module used in above section specifies debug log from guest OS running on gem5. You can list up all usable debug flags using ./build/ARM/gem5.opt --debug-help command.

SimpleSSD-Standalone

For standalone version, the log printout is quite limited (v2.0-rc3).

SimpleSSD Standalone v2.0
********** Begin of simulation **********
*** Statistics of Request Generator ***
Tick: 5000439428548
Time (ps): 79297718 - 5000439428548 (5000360130830)
I/O (bytes): 1393033216 (278586577.676911 B/s)
I/O (counts): 340096 (Read: 170048, Write: 170048)
*** End of statistics ***
*** Statistics of Event Engine ***
Simulation Tick (ps): 5000439428548
Host time duration (sec): 27.220903
Event handled: 10758702 (395236.771106 ops)
*** End of statistics ***
End of simulation @ tick 5000439428548

As shown as above code block, statistics of request generator contains total I/O count and bandwidth information.

Analyze on user-level

If you are using SimpleSSD-FullSystem, you can analyze the SimpleSSD using user-level benchmark programs. Compile benchmark program (see Build User-Level Program for gem5) and run inside of gem5.

Here is the example script (using with --script option).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/bash
cd /home/root
./stat 100 &
m5 resetstats
./fio --direct=1 --ioengine=libaio --iodepth=1 --bs=4096 \
--rw=read --filename=/dev/nvme0n1 --numjobs=1 --name=test --write_lat_log=test \
--write_iops_log=test --write_bw_log=test \
--log_avg_msec=10 --time_based --runtime=5s \
--randseed=13425
m5 dumpstats
m5 writefile test_clat.1.log
m5 writefile test_slat.1.log
m5 writefile test_lat.1.log
m5 writefile test_iops.1.log
m5 writefile test_bw.1.log
m5 exit

At line 3, stat program collects kernel/user CPU and DRAM utilization (through /proc filesystem). At line 4, reset all statistics of gem5 to zero. Execute fio and dump final statistics (line 10). After writing result files of fio, terminate simulation (line 16).

After simulation, you can get 6 files (one for fio output, 5 for fio log files).