Sunday, 24 May 2020

How to test your system infrastructure - Part 2


In the second part of this post I need to document way for testing how fast disk IO is doing and how fast network infrastructure between 2 nodes when using standard TCP, all tests assume Linux based infrastructure.

Disk IO:

There are multiple ways to measure disk IO performance and to measure how fast we can read and write to disk.
To report on how many read and write operations are being executed, we use the tool iostat.
iostat with the -d option will report information about disk IO utilization, in terms of amount of transfers per second and bytes of reads and writes operations per filesystem.
More information can be displayed with extended statistics option -x:
More information can be found in the iostat manual page: https://linux.die.net/man/1/iostat.

One way to list the filesystems connected to your system is to use the proc filesystem as below:

[root@feanor ~]# cat /proc/partitions
major minor  #blocks  name
   8        0   94753088 sda
   8        1    1048576 sda1
   8        2   93703168 sda2
  11        0      58360 sr0
 253        0   52428800 dm-0
 253        1    4063232 dm-1
 253        2   37203968 dm-2
[root@feanor ~]#

Knowing filesystem / device information is useful when using the iostat command.
Another way is to use the lsblk command:

[root@feanor ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0 90.4G  0 disk
├─sda1            8:1    0    1G  0 part /boot
└─sda2            8:2    0 89.4G  0 part
  ├─centos-root 253:0    0   50G  0 lvm  /
  ├─centos-swap 253:1    0  3.9G  0 lvm  [SWAP]
  └─centos-home 253:2    0 35.5G  0 lvm  /home
sr0              11:0    1   57M  0 rom 
[root@feanor ~]#


One other way to monitor the IO speed on the system is to use the iotop command, iotop would show interactive information on which process is using disk reads and writes and the percentage of  time used to do swap in operation and IO waiting.

To focus on the disk IO speed on a given disk or filesystem, we have multiple commands to help measure how fast we can read and write.

Using hdparm, we can measure how fast we can read from a disk using the -t and -T options to use check cached reads and direct device reads:

[root@feanor man]# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   23670 MB in  1.99 seconds = 11893.43 MB/sec
 Timing buffered disk reads: 1866 MB in  3.00 seconds = 621.44 MB/sec
[root@feanor man]#

Another way is to use the ioping command, this command would try to measure the disk latency and could print raw statistics:

[root@feanor man]# ioping /dev/sda1
4 KiB <<< /dev/sda1 (block device 1 GiB): request=1 time=45.8 ms (warmup)
4 KiB <<< /dev/sda1 (block device 1 GiB): request=2 time=46.8 ms
4 KiB <<< /dev/sda1 (block device 1 GiB): request=3 time=810.5 us
4 KiB <<< /dev/sda1 (block device 1 GiB): request=4 time=948.4 us
4 KiB <<< /dev/sda1 (block device 1 GiB): request=5 time=780.7 us
4 KiB <<< /dev/sda1 (block device 1 GiB): request=6 time=713.7 us
4 KiB <<< /dev/sda1 (block device 1 GiB): request=7 time=47.6 ms (slow)
^C
--- /dev/sda1 (block device 1 GiB) ioping statistics ---
6 requests completed in 97.6 ms, 24 KiB read, 61 iops, 245.8 KiB/s
generated 7 requests in 6.08 s, 28 KiB, 1 iops, 4.60 KiB/s
min/avg/max/mdev = 713.7 us / 16.3 ms / 47.6 ms / 21.9 ms
[root@feanor man]#

One other poor man's way to test the disk performance is to use the Linux dd command.
dd will print the average bytes per second speed it saw while executing the request.
You might want to execute multiple times to get a more useful average:

The below is a write test:

[root@feanor ~]# dd if=/dev/zero of=./tempfile bs=10K count=409600 status=progress conv=fdatasync
3872133120 bytes (3.9 GB) copied, 6.004911 s, 645 MB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 8.79669 s, 477 MB/s
[root@feanor ~]# dd if=/dev/zero of=./tempfile bs=10K count=409600 status=progress conv=fdatasync
3910645760 bytes (3.9 GB) copied, 7.023097 s, 557 MB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 9.18262 s, 457 MB/s
[root@feanor ~]#

and this one is a read test:

[root@feanor ~]# dd if=./tempfile of=/dev/null  bs=10K count=409600  status=progress
4128276480 bytes (4.1 GB) copied, 3.000355 s, 1.4 GB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 3.0407 s, 1.4 GB/s
[root@feanor ~]# dd if=./tempfile of=/dev/null  bs=10K count=409600  status=progress
3743528960 bytes (3.7 GB) copied, 3.002333 s, 1.2 GB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 3.307 s, 1.3 GB/s
[root@feanor ~]#

The above tests were affected by Linux disk caching, if we disable the cache, we can get results for the true disk performance:

[root@feanor ~]# echo 3 > /proc/sys/vm/drop_caches
[root@feanor ~]# dd if=./tempfile of=/dev/null  bs=10K count=409600  status=progress
3573360640 bytes (3.6 GB) copied, 3.003885 s, 1.2 GB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 3.47044 s, 1.2 GB/s
[root@feanor ~]# dd if=./tempfile of=/dev/null  bs=10K count=409600  status=progress
3053260800 bytes (3.1 GB) copied, 3.004402 s, 1.0 GB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 3.95025 s, 1.1 GB/s
[root@feanor ~]# dd if=./tempfile of=/dev/null  bs=10K count=409600  status=progress
3589294080 bytes (3.6 GB) copied, 3.000806 s, 1.2 GB/s
409600+0 records in
409600+0 records out
4194304000 bytes (4.2 GB) copied, 3.46599 s, 1.2 GB/s
[root@feanor ~]#

In my case it didn't seem to make a lot of difference if we drop the kernel disk caches.
One last test would be to use he bonnie++ package.
First we need to install it:

[root@feanor ~]# yum search bonnie
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * epel: mirror.i3d.net
==================================================================================== N/S matched: bonnie =====================================================================================
bonnie++.x86_64 : Filesystem and disk benchmark & burn-in suite

  Name and summary matches only, use "search all" for everything.
[root@feanor ~]# yum install bonnie++.x86_64

One bonnie++ is installed, we then run the test using a none root user and then we check the output:

[sherif@feanor ~]$ bonnie++ -f -n 0 |tee bonnie.out
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
feanor           8G           864053  70 513975  69           1430897  86  7303 271
Latency                         469ms     549ms               500ms   47897us

1.97,1.97,feanor,1,1590328006,8G,,,,864053,70,513975,69,,,1430897,86,7303,271,,,,,,,,,,,,,,,,,,,469ms,549ms,,500ms,47897us,,,,,,
[sherif@feanor ~]$

The text output is not the pretties, thus bonnie++ comes with a nice tool to convert the output to html:

[sherif@feanor ~]$ cat bonnie.out |bon_csv2html >/tmp/bonnie.out.html 2>/dev/null
[sherif@feanor sherif]# firefox /tmp/bonnie.out.html

The HTML report looks more readable:

Thus, it does seem that our disk is quite fast :)


Network:

To test network throughput and latency, we have a couple of tools to use.
For testing throughput, we can use the iperf tool.
To do the test, iperf needs to be installed on the 2 nodes involved in the test and should be running as a server on 1 of the node and a client on the other.

To run iperf as server we use the -s option:

sherif@fingon:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------


Then on the other node we run iperf as a client with the -c option, we need to provide the name of the server to connect to and optionally provide number of bytes used in the test with the -n option:

[root@feanor ~]#  iperf -n 10240000 -c fingon
------------------------------------------------------------
Client connecting to fingon, TCP port 5001
TCP window size:  280 KByte (default)
------------------------------------------------------------
[  3] local 192.168.56.104 port 48608 connected with 192.168.56.106 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 0.1 sec  9.77 MBytes  1.25 Gbits/sec
[root@feanor ~]#



Another way to test network through put is using the tool nuttcp, it is very similar to iperf and also works in client server model:

On server side we use the -S option:

[root@feanor ~]# nuttcp -S --nofork

Then on the client side, we run nuttcp with the server hostname or ip address:

sherif@fingon:~$ nuttcp -i1 feanor
  166.2500 MB /   1.00 sec = 1394.2422 Mbps     0 retrans
  204.1250 MB /   1.00 sec = 1712.4804 Mbps     0 retrans
  215.9375 MB /   1.00 sec = 1811.5092 Mbps     0 retrans
  190.5000 MB /   1.00 sec = 1597.7822 Mbps     0 retrans
   91.1875 MB /   1.00 sec =  764.9232 Mbps     0 retrans
  180.5625 MB /   1.00 sec = 1514.8680 Mbps     0 retrans
  209.0625 MB /   1.00 sec = 1753.7416 Mbps     0 retrans
  204.3750 MB /   1.00 sec = 1713.3612 Mbps     0 retrans
  206.1250 MB /   1.00 sec = 1730.2680 Mbps     0 retrans
  176.6875 MB /   1.00 sec = 1481.2068 Mbps     0 retrans

 1844.8750 MB /  10.43 sec = 1483.6188 Mbps 11 %TX 43 %RX 0 retrans 1.10 msRTT
sherif@fingon:~$


For testing network latency, we use old fashioned ping, ping reports the latency time statistics at the end of its run:

[root@feanor ~]# ping fingon
PING fingon (192.168.56.106) 56(84) bytes of data.
64 bytes from fingon (192.168.56.106): icmp_seq=1 ttl=64 time=0.628 ms
64 bytes from fingon (192.168.56.106): icmp_seq=2 ttl=64 time=1.26 ms
64 bytes from fingon (192.168.56.106): icmp_seq=3 ttl=64 time=1.39 ms
64 bytes from fingon (192.168.56.106): icmp_seq=4 ttl=64 time=1.02 ms
64 bytes from fingon (192.168.56.106): icmp_seq=5 ttl=64 time=1.12 ms
64 bytes from fingon (192.168.56.106): icmp_seq=6 ttl=64 time=1.16 ms
64 bytes from fingon (192.168.56.106): icmp_seq=7 ttl=64 time=1.15 ms
64 bytes from fingon (192.168.56.106): icmp_seq=8 ttl=64 time=1.21 ms
^C
--- fingon ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7029ms
rtt min/avg/max/mdev = 0.628/1.120/1.393/0.215 ms
[root@feanor ~]#


Using sar:

last part of this post is dedicated the the good old sar system reporting tool.
sar offers a comprehensive set of reported statistics about the system CPU, memory and IO operations.
sar reports the data collected in various points in time and can provide very useful information about patterns of usage for system resources.
Below are a couple of examples:

[root@feanor ~]# sar -n DEV
Linux 3.10.0-862.el7.x86_64 (feanor)    05/24/2020      _x86_64_        (4 CPU)

08:15:02 AM       LINUX RESTART

08:20:02 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
08:30:01 AM    enp0s3      0.08      0.09      0.01      0.01      0.00      0.00      0.00
08:30:01 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:30:01 AM    enp0s9      0.01      0.01      0.00      0.00      0.00      0.00      0.00
08:30:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:30:01 AM virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:30:01 AM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:40:02 AM    enp0s3      0.06      0.06      0.00      0.01      0.00      0.00      0.00
08:40:02 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:40:02 AM    enp0s9      0.13      0.01      0.01      0.00      0.00      0.00      0.07
08:40:02 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:40:02 AM virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:40:02 AM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:50:01 AM    enp0s3      0.04      0.04      0.00      0.00      0.00      0.00      0.00
08:50:01 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:50:01 AM    enp0s9      0.02      0.01      0.00      0.00      0.00      0.00      0.01
08:50:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:50:01 AM virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:50:01 AM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:00:01 AM    enp0s3      0.02      0.03      0.00      0.00      0.00      0.00      0.00
09:00:01 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:00:01 AM    enp0s9      0.01      0.01      0.00      0.00      0.00      0.00      0.00
09:00:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:00:01 AM virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:00:01 AM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:10:01 AM    enp0s3      0.02      0.02      0.00      0.00      0.00      0.00      0.00
09:10:01 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:10:01 AM    enp0s9      0.04      0.01      0.01      0.00      0.00      0.00      0.01
09:10:01 AM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:10:01 AM virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:10:01 AM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:20:01 AM    enp0s3     10.32      1.67     14.53      0.11      0.00      0.00      0.00
09:20:01 AM    enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:20:01 AM    enp0s9      0.01      0.01      0.00      0.00      0.00      0.00      0.01
.....


05:40:01 PM    virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       enp0s3      0.12      0.12      0.03      0.01      0.00      0.00      0.00
Average:       enp0s8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       enp0s9    451.53    225.65    658.82    589.50      0.00      0.00      0.01
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    virbr0-nic      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       virbr0      0.00      0.00      0.00      0.00      0.00      0.00      0.00



[root@feanor ~]# sar -u
Linux 3.10.0-862.el7.x86_64 (feanor)    05/24/2020      _x86_64_        (4 CPU)

08:15:02 AM       LINUX RESTART

08:20:02 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:30:01 AM     all      0.29      0.00      0.27      0.05      0.00     99.39
08:40:02 AM     all      0.09      0.00      0.11      0.02      0.00     99.78
08:50:01 AM     all      1.51      0.00      0.51      0.05      0.00     97.93
09:00:01 AM     all      1.67      0.00      1.29      0.75      0.00     96.29
09:10:01 AM     all      1.13      0.00      0.43      0.05      0.00     98.39
09:20:01 AM     all      0.40      0.00      0.60      0.19      0.00     98.81
09:30:01 AM     all      0.21      0.00      0.20      0.05      0.00     99.53
09:40:01 AM     all      0.39      0.00      1.60      0.17      0.00     97.85
Average:        all      0.71      0.00      0.62      0.17      0.00     98.50

09:44:57 AM       LINUX RESTART


For a complete set of data, one could use the sar -A command which will log a huge amount of information about the server in the current day.






References & good reads:

https://linuxhint.com/disk_activity_web_server/
https://wiki.archlinux.org/index.php/Benchmarking
https://www.cyberciti.biz/faq/howto-linux-unix-test-disk-performance-with-dd-command/
https://www.opsdash.com/blog/disk-monitoring-linux.html
https://haydenjames.io/linux-server-performance-disk-io-slowing-application/
https://www.unixmen.com/how-to-measure-disk-performance-with-fio-and-ioping/
https://fio.readthedocs.io/en/latest/index.html
https://dotlayer.com/how-to-use-fio-to-measure-disk-performance-in-linux/
https://linux-mm.org/Drop_Caches
https://books.google.nl/books?id=1nc5DwAAQBAJ&printsec=frontcover&hl=nl&source=gbs_ge_summary_r&cad=0#v=onepage&q=bonnie&f=false
https://www.cyberciti.biz/faq/ping-test-a-specific-port-of-machine-ip-address-using-linux-unix/
https://linoxide.com/monitoring-2/10-tools-monitor-cpu-performance-usage-linux-command-line/

Saturday, 16 May 2020

How to test your system infrastructure - Part 1

In this post I need to document various ways to test parts of an application infrastructure, mainly how to check CPU usage, how fast is disk IO, how fast is network infrastructure between 2 nodes, all tests assume Linux based infrastructure.


CPU:

On Linux the easiest way to check how much CPU is being used is using the top command:
top is an interactive command, clicking 1 while top is running, it will print the CPU usage per core.
top can also be run in none interactive mode as needed:

sherif@fingolfin:~$ top -b -n 1 |head
top - 14:22:23 up 52 min,  1 user,  load average: 0,29, 0,12, 0,06
Tasks: 175 total,   1 running, 129 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,5 us,  0,3 sy,  0,0 ni, 98,9 id,  0,2 wa,  0,0 hi,  0,1 si,  0,0 st
KiB Mem :  6072348 total,  4880976 free,   434896 used,   756476 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  5398192 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2063 root      20   0  420504  94464  33488 S   2,3  1,6   0:25.45 Xorg
    1 root      20   0  225232   9024   6748 S   0,0  0,1   0:02.70 systemd
    2 root      20   0       0      0      0 S   0,0  0,0   0:00.00 kthreadd
sherif@fingolfin:~$

Another way to report on CPU usage is using iostat command:

[root@feanor ~]# iostat -c
Linux 3.10.0-862.el7.x86_64 (feanor)    05/16/2020      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.00    0.28    0.07    0.00   99.42

[root@feanor ~]#

One other way to benchmark the CPU execution on the system is to use the sysbench package as below:

sherif@fingolfin:~$ time sysbench --test=cpu --threads=6 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 6
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
    events per second:  3886.37
General statistics:
    total time:                          10.0011s
    total number of events:              38872
Latency (ms):
         min:                                  0.62
         avg:                                  1.54
         max:                                 29.13
         95th percentile:                      8.74
         sum:                              59820.54
Threads fairness:
    events (avg/stddev):           6478.6667/112.54
    execution time (avg/stddev):   9.9701/0.03
real    0m10,014s
user    0m29,940s
sys    0m0,012s
sherif@fingolfin:~$

The above test shows how much latency could be expected running multiple threads on the system.
More info about sysbench tool can be found in this page: https://linuxconfig.org/how-to-benchmark-your-linux-system

Memory:


To measure how fast our system memory works, we can use the small tool mbw from: https://github.com/raas/mbw.
The tools mesaures the memory bandwidth from user space, similar to what could be noticed by standard applications.
To compile the code on Centos we follow the below:

[root@feanor ~]# git clone https://github.com/raas/mbw
Cloning into 'mbw'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 89 (delta 0), reused 1 (delta 0), pack-reused 85
Unpacking objects: 100% (89/89), done.
[root@feanor ~]# cd mbw
[root@feanor mbw]# ls -ltr
total 28
-rw-r--r--. 1 root root  423 May 16 16:12 README
-rw-r--r--. 1 root root  232 May 16 16:12 Makefile
-rw-r--r--. 1 root root 1255 May 16 16:12 mbw.1
-rw-r--r--. 1 root root 1640 May 16 16:12 mbw.spec
-rw-r--r--. 1 root root 8538 May 16 16:12 mbw.c
[root@feanor mbw]# make
cc     mbw.c   -o mbw
[root@feanor mbw]# ./mbw 512
Long uses 8 bytes. Allocating 2*67108864 elements = 1073741824 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MEMCPY  Elapsed: 0.08889        MiB: 512.00000  Copy: 5760.122 MiB/s
1       Method: MEMCPY  Elapsed: 0.09538        MiB: 512.00000  Copy: 5368.283 MiB/s
2       Method: MEMCPY  Elapsed: 0.09289        MiB: 512.00000  Copy: 5512.133 MiB/s
3       Method: MEMCPY  Elapsed: 0.09756        MiB: 512.00000  Copy: 5247.891 MiB/s
4       Method: MEMCPY  Elapsed: 0.09414        MiB: 512.00000  Copy: 5438.593 MiB/s
5       Method: MEMCPY  Elapsed: 0.08911        MiB: 512.00000  Copy: 5745.450 MiB/s
6       Method: MEMCPY  Elapsed: 0.08720        MiB: 512.00000  Copy: 5871.627 MiB/s
7       Method: MEMCPY  Elapsed: 0.09688        MiB: 512.00000  Copy: 5284.616 MiB/s
8       Method: MEMCPY  Elapsed: 0.09409        MiB: 512.00000  Copy: 5441.598 MiB/s
9       Method: MEMCPY  Elapsed: 0.09243        MiB: 512.00000  Copy: 5539.087 MiB/s
AVG     Method: MEMCPY  Elapsed: 0.09286        MiB: 512.00000  Copy: 5513.825 MiB/s
0       Method: DUMB    Elapsed: 0.25512        MiB: 512.00000  Copy: 2006.875 MiB/s
1       Method: DUMB    Elapsed: 0.23047        MiB: 512.00000  Copy: 2221.528 MiB/s
2       Method: DUMB    Elapsed: 0.22259        MiB: 512.00000  Copy: 2300.245 MiB/s
3       Method: DUMB    Elapsed: 0.23621        MiB: 512.00000  Copy: 2167.544 MiB/s
4       Method: DUMB    Elapsed: 0.21707        MiB: 512.00000  Copy: 2358.697 MiB/s
5       Method: DUMB    Elapsed: 0.22799        MiB: 512.00000  Copy: 2245.742 MiB/s
6       Method: DUMB    Elapsed: 0.22476        MiB: 512.00000  Copy: 2277.965 MiB/s
7       Method: DUMB    Elapsed: 0.22205        MiB: 512.00000  Copy: 2305.777 MiB/s
8       Method: DUMB    Elapsed: 0.22730        MiB: 512.00000  Copy: 2252.490 MiB/s
9       Method: DUMB    Elapsed: 0.22879        MiB: 512.00000  Copy: 2237.899 MiB/s
AVG     Method: DUMB    Elapsed: 0.22924        MiB: 512.00000  Copy: 2233.515 MiB/s
0       Method: MCBLOCK Elapsed: 0.09570        MiB: 512.00000  Copy: 5350.052 MiB/s
1       Method: MCBLOCK Elapsed: 0.10106        MiB: 512.00000  Copy: 5066.197 MiB/s
2       Method: MCBLOCK Elapsed: 0.09312        MiB: 512.00000  Copy: 5498.459 MiB/s
3       Method: MCBLOCK Elapsed: 0.09769        MiB: 512.00000  Copy: 5240.961 MiB/s
4       Method: MCBLOCK Elapsed: 0.09894        MiB: 512.00000  Copy: 5174.958 MiB/s
5       Method: MCBLOCK Elapsed: 0.09634        MiB: 512.00000  Copy: 5314.456 MiB/s
6       Method: MCBLOCK Elapsed: 0.09780        MiB: 512.00000  Copy: 5235.388 MiB/s
7       Method: MCBLOCK Elapsed: 0.09487        MiB: 512.00000  Copy: 5397.086 MiB/s
8       Method: MCBLOCK Elapsed: 0.09828        MiB: 512.00000  Copy: 5209.446 MiB/s
9       Method: MCBLOCK Elapsed: 0.09942        MiB: 512.00000  Copy: 5149.973 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.09732        MiB: 512.00000  Copy: 5260.924 MiB/s
[root@feanor mbw]#

The tool is available as a Debian package.
One cool test is to see when the tool tries to allocate 4GB on the above system, that machine has only 4GB of memory, and allocating that size would drive the mbw tool to get swapped out, we can see that with multiple ways, first, the bandwidth is orders of mangitude lower:

[root@feanor mbw]# ./mbw 2048
Long uses 8 bytes. Allocating 2*268435456 elements = 4294967296 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MEMCPY  Elapsed: 26.14064       MiB: 2048.00000 Copy: 78.345 MiB/s
1       Method: MEMCPY  Elapsed: 42.49331       MiB: 2048.00000 Copy: 48.196 MiB/s
2       Method: MEMCPY  Elapsed: 18.70199       MiB: 2048.00000 Copy: 109.507 MiB/s
3       Method: MEMCPY  Elapsed: 55.37665       MiB: 2048.00000 Copy: 36.983 MiB/s
4       Method: MEMCPY  Elapsed: 35.01051       MiB: 2048.00000 Copy: 58.497 MiB/s
5       Method: MEMCPY  Elapsed: 20.52362       MiB: 2048.00000 Copy: 99.787 MiB/s
6       Method: MEMCPY  Elapsed: 21.93620       MiB: 2048.00000 Copy: 93.362 MiB/s
7       Method: MEMCPY  Elapsed: 37.51056       MiB: 2048.00000 Copy: 54.598 MiB/s
8       Method: MEMCPY  Elapsed: 28.07473       MiB: 2048.00000 Copy: 72.948 MiB/s
9       Method: MEMCPY  Elapsed: 14.76706       MiB: 2048.00000 Copy: 138.687 MiB/s
AVG     Method: MEMCPY  Elapsed: 30.05353       MiB: 2048.00000 Copy: 68.145 MiB/s
0       Method: DUMB    Elapsed: 11.23370       MiB: 2048.00000 Copy: 182.309 MiB/s
1       Method: DUMB    Elapsed: 10.76112       MiB: 2048.00000 Copy: 190.315 MiB/s
2       Method: DUMB    Elapsed: 15.99955       MiB: 2048.00000 Copy: 128.004 MiB/s
3       Method: DUMB    Elapsed: 23.18597       MiB: 2048.00000 Copy: 88.329 MiB/s
4       Method: DUMB    Elapsed: 28.14035       MiB: 2048.00000 Copy: 72.778 MiB/s
5       Method: DUMB    Elapsed: 31.18035       MiB: 2048.00000 Copy: 65.682 MiB/s
6       Method: DUMB    Elapsed: 31.02135       MiB: 2048.00000 Copy: 66.019 MiB/s
7       Method: DUMB    Elapsed: 36.10925       MiB: 2048.00000 Copy: 56.717 MiB/s
8       Method: DUMB    Elapsed: 51.37134       MiB: 2048.00000 Copy: 39.867 MiB/s
9       Method: DUMB    Elapsed: 60.84004       MiB: 2048.00000 Copy: 33.662 MiB/s
AVG     Method: DUMB    Elapsed: 29.98430       MiB: 2048.00000 Copy: 68.302 MiB/s
0       Method: MCBLOCK Elapsed: 67.50246       MiB: 2048.00000 Copy: 30.340 MiB/s
1       Method: MCBLOCK Elapsed: 74.09162       MiB: 2048.00000 Copy: 27.641 MiB/s
2       Method: MCBLOCK Elapsed: 77.48624       MiB: 2048.00000 Copy: 26.430 MiB/s
3       Method: MCBLOCK Elapsed: 75.32009       MiB: 2048.00000 Copy: 27.191 MiB/s
4       Method: MCBLOCK Elapsed: 94.43207       MiB: 2048.00000 Copy: 21.688 MiB/s
5       Method: MCBLOCK Elapsed: 96.87246       MiB: 2048.00000 Copy: 21.141 MiB/s
6       Method: MCBLOCK Elapsed: 102.09089      MiB: 2048.00000 Copy: 20.061 MiB/s
7       Method: MCBLOCK Elapsed: 95.71384       MiB: 2048.00000 Copy: 21.397 MiB/s
8       Method: MCBLOCK Elapsed: 89.24437       MiB: 2048.00000 Copy: 22.948 MiB/s
9       Method: MCBLOCK Elapsed: 103.73286      MiB: 2048.00000 Copy: 19.743 MiB/s
AVG     Method: MCBLOCK Elapsed: 87.64869       MiB: 2048.00000 Copy: 23.366 MiB/s
[root@feanor mbw]#

Using the iotop tool we can see that mbw tool is swapped out:
 And using the smem tool we can see that mbw tool is swapping:

[root@feanor mbw]# smem |head -1; smem|grep mbw
  PID User     Command                         Swap      USS      PSS      RSS
 4144 root     grep --color=auto mbw              0      140      326      704
 4005 root     ./mbw 2048                    584004  3610400  3610400  3610408
[root@feanor mbw]#


To collect system wide memory information, we can use top command or we can also use vmstat:

[root@feanor mbw]# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2 2046444 112560      0 125508 8596 5262 13055  5449 2268  827  1  9 76 14  0

[root@feanor mbw]# vmstat -s
      4043552 K total memory
      3824396 K used memory
      2936656 K active memory
       782444 K inactive memory
       107972 K free memory
            0 K buffer memory
       111184 K swap cache
      4063228 K total swap
      2026724 K used swap
      2036504 K free swap
        16716 non-nice user cpu ticks
           20 nice user cpu ticks
        89870 system cpu ticks
       926212 idle cpu ticks
       171799 IO-wait cpu ticks
            0 IRQ cpu ticks
        21806 softirq cpu ticks
            0 stolen cpu ticks
    160271685 pages paged in
     66888866 pages paged out
     26373088 pages swapped in
     16147227 pages swapped out
     27846320 interrupts
     10151325 CPU context switches
   1589638134 boot time
         4248 forks
[root@feanor mbw]#

The vmstat tools uses the kernel file /proc/meminfo which contains more information about system memory usage as can be seen below:

[root@feanor mbw]# cat /proc/meminfo
MemTotal:        4043552 kB
MemFree:         3622752 kB
MemAvailable:    3570904 kB
Buffers:               0 kB
Cached:           110172 kB
SwapCached:        52480 kB
Active:            66004 kB
Inactive:         146504 kB
Active(anon):      49768 kB
Inactive(anon):    61024 kB
Active(file):      16236 kB
Inactive(file):    85480 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4063228 kB
SwapFree:        3716148 kB
Dirty:                20 kB
Writeback:             0 kB
AnonPages:         64504 kB
Mapped:            24988 kB
Shmem:              8356 kB
Slab:              79168 kB
SReclaimable:      32396 kB
SUnreclaim:        46772 kB
KernelStack:        6752 kB
PageTables:        32416 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6085004 kB
Committed_AS:    2550352 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      105404 kB
VmallocChunk:   34359537660 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      131008 kB
DirectMap2M:     4063232 kB
[root@feanor mbw]#

Another small nice tool to report memory usage on a linux system is the free tool:

[root@feanor mbw]# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        235M        3.4G        8.7M        230M        3.4G
Swap:          3.9G        335M        3.5G
[root@feanor mbw]#


Here the output is similar to what we get from top, we can see how much swap is used, how much memory is used by Linux to for buffers and disk caching and how much is memory available to the system.