getting system health state
-
Topic author - Valued Contributor
- Posts: 91
- Joined: Fri Jul 12, 2019 1:59 pm
- Reputation: 0
- Location: Netherlands
- Status: Offline
- Contact:
getting system health state
Is there a way to get the "physical health state" of an Alpha (PWS, DS10) and Itanium (RX2620): CPU and memory temperature, fan state, power state etc, by a program, to monitor and signalling of "extreme" situations BEFORE they occur?
DS10 has RMC in snoop mode, which allows me to display a status - when it does react, which is not always the case - but that can only be done via console.
DS10 has RMC in snoop mode, which allows me to display a status - when it does react, which is not always the case - but that can only be done via console.
Re: getting system health state
For my DS10L, I can get CPU temperature examining f$getsyi("TEMPERATURE_VECTOR"). The last 2 digits are a hex encoding of the temperature in Celsius. I haven't found any other CPUs that this works. Since the 10L is a 1U box, keeping the thing cool is a problem.
-
- Member
- Posts: 6
- Joined: Thu Feb 18, 2021 9:50 am
- Reputation: 0
- Status: Offline
Re: getting system health state
This script used to work on my DS20E Alpha's, but not anymore on my current Integrity servers. Currently using CockpitMgr which does all the HW monitoring.
I added an extra check to the "env_check" script:
I added an extra check to the "env_check" script:
Code: Select all
---
$ activecpu_cnt = f$getsyi("ACTIVECPU_CNT")
$ availcpu_cnt = f$getsyi("AVAILCPU_CNT")
----
$ gosub cpu_check
----
$cpu_check:
$ if availcpu_cnt .gt. 1
$ then
$ if activecpu_cnt .lt. availcpu_cnt
$ then write sys$output -
"CPU is BAD : avail ''availcpu_cnt' / active ''activecpu_cnt'"
$ else write sys$output "CPUs are Good"
$ endif
$ else
$ write sys$output "CPU is Good"
$ endif
$ return
----
-
Topic author - Valued Contributor
- Posts: 91
- Joined: Fri Jul 12, 2019 1:59 pm
- Reputation: 0
- Location: Netherlands
- Status: Offline
- Contact:
Re: getting system health state
Thanks - great script, gives the information I needed. It would be nice if Cockpit manager was available for community members in some form
Last edited by willemgrooters on Wed Jan 05, 2022 11:41 am, edited 1 time in total.
-
- Master
- Posts: 166
- Joined: Fri Jun 28, 2019 8:45 am
- Reputation: 0
- Location: South Tyneside, UK
- Status: Offline
- Contact:
Re: getting system health state
What I'm doing using DCL and Kermit scripting to connect to the iLo and do a PS command to get the power supply status and temp then checking the output. Works for Alpha and I64.
Ian Miller
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].
Re: getting system health state
Hello Ian,
is there an example how this can be done?
That would be very helpful.
The script from Neil is not working on my Integrity. Some "Termal_vector" is not loaded.
Best Regards,
Christoph
is there an example how this can be done?
That would be very helpful.
The script from Neil is not working on my Integrity. Some "Termal_vector" is not loaded.
Best Regards,
Christoph
-
- Master
- Posts: 166
- Joined: Fri Jun 28, 2019 8:45 am
- Reputation: 0
- Location: South Tyneside, UK
- Status: Offline
- Contact:
Re: getting system health state
the DCL is like this
Code: Select all
$!
$! CHECK_TEMP - record temp and report if too high
$!
$!'F$VERIFY(0)
$START:
$ SET RESTART="START"
$ ON ERROR THEN GOTO END
$!
$ warming_temp = 29 ! it's a bit warm in here
$ warning_temp = 34 ! one less than the warning temp set on the iLo
$ critical_temp = 36 ! two less than the critical temp set on the iLo
$!
$ nodename = F$GETSYI("NODENAME")
$ script = F$SEARCH("CMANAGER:GET-''nodename'-TEMP.KERMIT")
$ IF script .EQS. ""
$ THEN
$ WRITE SYS$OUTPUT "**** UNKNOWN NODE ''node name'"
$ EXIT 4
$ ENDIF
$ temprecord = "CLOGS:''nodename'_TEMPRATURE.CSV" ! record of temp sensor values - datetime,value
$!
$ SUBMIT/QUEUE='nodename'_SYS_BATCH/USER=SYSTEM/LOG=CLOGS:CHECK_'nodename'_TEMP.LOG/AFTER="+01:00"/RESTART CMANAGER:CHECK_TEMP.COM
$ @UTILS:PRUNE CLOGS:CHECK_'nodename'_TEMP.LOG 10
$ CKERMIT :== $UTILS:CKV300-I64-VMS831H1-UCX56.EXE ! Kermit - a blast from the past :-)
$ DEFINE/USER SYS$OUTPUT T.TMP
$ CKERMIT "-B" -y 'script'
$! pick out the line "Ambient temperature : 25 C"
$ SEARCH/OUTPUT=T2.TMP T.TMP "Ambient"
$ IF $SEVERITY .NE. 1 THEN GOTO TMPERR
$ OPEN/READ tf T2.TMP
$ READ tf line
$ CLOSE tf
$ temp = F$ELEMENT(1,":",F$EDIT(line,"COLLAPSE")) - "C"
$ now = F$ELEMENT(0,".",F$TIME()) ! get current datetime without fractional seconds
$ IF F$SEARCH(temprecord) .EQS. "" THEN COPY NL: 'temprecord'
$ OPEN/APPEND tr 'temprecord'
$ WRITE tr "''now',''temp'"
$ CLOSE tr
$!
$ DEFINE/NOLOG TCPIP$SMTP_FROM "Node-Checks@xxx.COM"
$!
$ IF temp .GE. critical_temp
$ THEN
$ MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has reached critical temperature and may shutdown"
$ GOTO END1
$ ENDIF
$ IF temp .GE. warning_temp
$ THEN
$ MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has reached warning temperature"
$ GOTO END1
$ ENDIF
$ IF temp .GE. warming_temp
$ THEN
$ MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' is getting warm"
$ GOTO END1
$ ENDIF
$END1:
$!
$! Check fan status
$! [ search result of PS command for Fan status lines then those lines for status not Normal ]
$ SEARCH T.TMP "Fan Unit"/OUTPUT=T1.TMP
$ OPEN/READ T1 T1.TMP
$L1:
$ READ/END=T1END T1 LINE
$ LINE = F$EDIT(LINE,"COMPRESS")
$ unit = F$ELEMENT(1," ", LINE)
$ IF unit .EQS. "Unit" ! Check line is one we want
$ THEN
$ unitnum = F$ELEMENT(2," ", LINE)
$ fanstatus = F$ELEMENT(3," ", LINE)
$ IF fanstatus .NES. "Normal"
$ THEN
$ MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has problem with fan ''unitnum' status is ''fanstatus'"
$ ENDIF
$ ENDIF
$ GOTO L1
$T1END:
$ CLOSE T1
$!
$END:
$ DELETE/NOCONFIRM/NOLOG T.TMP;*
$ DELETE/NOCONFIRM/NOLOG T1.TMP;*
$ DELETE/NOCONFIRM/NOLOG T2.TMP;*
$ EXIT
$TMPERR:
$ MAIL NL: "@CMANANGER:SYSMAN.DIS" /SUBJECT="error getting temperature of ''nodename'"
$ IF F$SEARCH("T.TMP") .NES. "" THEN TYPE T.TMP
$ GOTO END
[code]
Kermit script is like this
[code]
set host host.company.com
input 3 MP login:
lineout Oper
input 3 MP password:
lineout somepassword
input 3 hpiLO->
lineout CM
set input echo on
input 3 CM:hpiLO->
lineout PS
input 3 CM:hpiLO->
output \x02
input 3 hpiLO->
lineout X
exit 0
[code]
Ian Miller
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].