At work we’ve bought new NetApp filers, and I’ve been tasked with the installation. One of the new things that I decided to inplement is iSCSI monitoring, anything critical should be monitored anyway.This is where I hit the first hurdle, nobody seems to do iSCSI monitoring with nagios. Google doesn’t show any check_iscsi examples. So I started investigating the available options. The main contender is iscsiadm from the open-iscsi stack. For monitoring purposes this package has a big problem; it uses daemons to save state, which probably speeds things up a fair bit, but totally ruins it for monitoring purposes.
Then I found http://code.google.com/p/freebsd-iscsi/ which after a careful look does exactly what’s needed, iSCSI sendtargets discovery and not much else. When adapting the software for nagios I quickly noticed some segfaults, which warranted further investigation. It turns out that a buffer of 1024 bytes is allocated, which is used for transmitting and receiving iSCSI messages. But the code which parses messages just looks at the datasegmentlength field in the header, which is used even when it exceeds the buffersize of 1024. When trying to find a simple solution for this problem I noticed the MaxRecvDataSegmentLength parameter which can be used during login to limit the length of messages. This looked like a quick fix, so I dediced to try it out. The NetApp filer I was talking to however didn’t appreciate the change.
When creating a sendtargets responce a NetApp will include every network interface which has iSCSI enabled. This will cause the response to be quite large if you have 10-15 active interfaces. So when trying to stuff all this information, about 1517 bytes normally in two 1500 bytes ethernet frames, in 512 bytes I hit the following panic:
Tue Feb 12 14:19:54 CET [mgr.stack.string:notice]: Panic string: ../driver/scsitarget/iswt/iswti_text.c:624: Assertion failure. in process iswti_iscsip_thread on release NetApp Release 7.2.4L1
The sad thing with NetApp is that most bugs like this will cause a reboot, which interrupts operation quite heavily. The main problem with this one is that if iSCSI is enabled it probably isn’t possible to stop people rebooting your filers.