换了网线异常了,crs无法正常启动,clssnmsendingthread: sending status msg to all nodes
同事换网线前我将节点2正常关闭了,换完网线告诉我,发现节点2死活起不来了,看上面的日志和一些帖子最后也没解决,尝试过重启、网线拔掉重新插上、查看过存储是否正常和存储重新挂载。。。。看过一个帖子说可能是ocr信息发生了改变,不过之前没备份,也没忘这方面深入考虑。
最后还是没搞定,主要是技术有限,没准确的定位出具体问题也不敢轻易乱动。。。
20xx-12-16 19:01:05.792: [ cssd][3786819328]clssnmsendingthread: sending join msg to all nodes
20xx-12-16 19:01:05.792: [ cssd][3786819328]clssnmsendingthread: sent 5 join msgs to all nodes
20xx-12-16 19:01:06.295: [gipchalo][3811858176] gipchalowerprocessnode: no valid interfaces found to node for 7286464 ms, node 0x7fecd0028450 { host ‘myrac1’, haname ‘css_myrac-cluster’, srcluid fac66ea4-f1a960af, dstluid 00000000-00000000 numinf 0, contigseq 0, lastack 0, lastvalidack 0, sendseq [249 : 249], createtime 7037424, sentregister 1, localmonitor 1, flags 0x4 }
20xx-12-16 19:01:06.303: [ cssd][3789973248]clssgmwaitoneventvalue: after cminfo state val 3, eval 1 waited 0
20xx-12-16 19:01:06.420: [ cssd][3799754496]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618800, lats 7286584, lastseqno 211618797, uniqueness 1576485880, timestamp 1576494065/8540734
20xx-12-16 19:01:06.435: [ cssd][3804591872]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618802, lats 7286594, lastseqno 211618799, uniqueness 1576485880, timestamp 1576494066/8541524
20xx-12-16 19:01:07.304: [ cssd][3789973248]clssgmwaitoneventvalue: after cminfo state val 3, eval 1 waited 0
20xx-12-16 19:01:07.421: [ cssd][3799754496]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618803, lats 7287584, lastseqno 211618800, uniqueness 1576485880, timestamp 1576494066/8541734
20xx-12-16 19:01:07.435: [ cssd][3804591872]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618805, lats 7287604, lastseqno 211618802, uniqueness 1576485880, timestamp 1576494067/8542524
20xx-12-16 19:01:08.304: [ cssd][3789973248]clssgmwaitoneventvalue: after cminfo state val 3, eval 1 waited 0
20xx-12-16 19:01:08.422: [ cssd][3799754496]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618806, lats 7288584, lastseqno 211618803, uniqueness 1576485880, timestamp 1576494067/8542734
20xx-12-16 19:01:08.436: [ cssd][3804591872]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618808, lats 7288604, lastseqno 211618805, uniqueness 1576485880, timestamp 1576494068/8543524
20xx-12-16 19:01:09.304: [ cssd][3789973248]clssgmwaitoneventvalue: after cminfo state val 3, eval 1 waited 0
20xx-12-16 19:01:09.422: [ cssd][3799754496]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618809, lats 7289584, lastseqno 211618806, uniqueness 1576485880, timestamp 1576494068/8543744
20xx-12-16 19:01:09.437: [ cssd][3804591872]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618811, lats 7289604, lastseqno 211618808, uniqueness 1576485880, timestamp 1576494069/8544524
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmrcfgmgrthread: local join
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmlocaljoinevent: begin on node(2), waittime 193000
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmlocaljoinevent: set curtime (7289964) for my node
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmlocaljoinevent: scanning 32 nodes
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmlocaljoinevent: node myrac1, number 1, is in an existing cluster with disk state 3
20xx-12-16 19:01:09.803: [ cssd][3785242368]clssnmlocaljoinevent: takeover aborted due to cluster member node found on disk
20xx-12-16 19:01:10.305: [ cssd][3789973248]clssgmwaitoneventvalue: after cminfo state val 3, eval 1 waited 0
20xx-12-16 19:01:10.423: [ cssd][3799754496]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618812, lats 7290584, lastseqno 211618809, uniqueness 1576485880, timestamp 1576494069/8544744
20xx-12-16 19:01:10.437: [ cssd][3804591872]clssnmvdhbvalidatencopy: node 1, myrac1, has a disk hb, but no network hb, dhb has rcfg 471981092, wrtcnt, 211618814, lats 7290604, lastseqno 211618811, uniqueness 1576485880, timestamp 1576494070/8545524
20xx-12-16 19:01:10.794: [ cssd][3786819328]clssnmsendingthread: sending join msg to all nodes
20xx-12-16 19:01:10.794: [ cssd][3786819328]clssnmsendingthread: sent 5 join msgs to all nodes
20xx-12-16 20:36:02.919: [ cssd][2756265728]clssgmupdategrpdata: grock(clsn.onsnetproc.master), commissioner(-1/0)
20xx-12-16 20:36:02.919: [ cssd][2756265728]clssgmhandlegrockrcfgupdate: grock(clsn.onsnetproc.master), updateseq(118), status(0), sendresp(1)
20xx-12-16 20:36:02.920: [ cssd][2756265728]clssgmtestsetlastgrockupdate: grock(clsn.onsnetproc.master), updateseq(118) msgseq(119), lastupdt<0x7fbb58031e10>, ignoreseq(0)
20xx-12-16 20:36:02.920: [ cssd][2756265728]clssgmgrockoptagprocess: request to commission member(1) using key(1) for grock(clsn.onsnetproc.master)
20xx-12-16 20:36:02.920: [ cssd][2756265728]clssgmupdategrpdata: grock(clsn.onsnetproc.master), commissioner(1/1)
20xx-12-16 20:36:02.920: [ cssd][2756265728]clssgmhandlegrockrcfgupdate: grock(clsn.onsnetproc.master), updateseq(119), status(0), sendresp(1)
20xx-12-16 20:36:02.921: [ cssd][2756265728]clssgmtestsetlastgrockupdate: grock(clsn.onsnetproc.master), updateseq(119) msgseq(120), lastupdt<0x7fbb5804d490>, ignoreseq(0)
20xx-12-16 20:36:02.921: [ cssd][2756265728]clssgmupdategrpdata: grock(clsn.onsnetproc.master), private data(2052), incarn(40)
20xx-12-16 20:36:02.921: [ cssd][2756265728]clssgmhandlegrockrcfgupdate: grock(clsn.onsnetproc.master), updateseq(120), status(0), sendresp(1)
20xx-12-16 20:36:02.922: [ cssd][2756265728]clssgmtestsetlastgrockupdate: grock(clsn.onsnetproc.master), updateseq(120) msgseq(121), lastupdt<0x7fbb5803dee0>, ignoreseq(0)
20xx-12-16 20:36:02.922: [ cssd][2756265728]clssgmgrockoptagprocess: request to commission member(-1) using key(1) for grock(clsn.onsnetproc.master)
20xx-12-16 20:36:02.922: [ cssd][2756265728]clssgmupdategrpdata: grock(clsn.onsnetproc.master), commissioner(-1/0)
20xx-12-16 20:36:02.922: [ cssd][2756265728]clssgmhandlegrockrcfgupdate: grock(clsn.onsnetproc.master), updateseq(121), status(0), sendresp(1)
20xx-12-16 20:36:05.064: [ cssd][2753111808]clssnmsendingthread: sending status msg to all nodes
20xx-12-16 20:36:05.064: [ cssd][2753111808]clssnmsendingthread: sent 5 status msgs to all nodes
20xx-12-16 20:36:09.065: [ cssd][2753111808]clssnmsendingthread: sending status msg to all nodes
20xx-12-16 20:36:09.065: [ cssd][2753111808]clssnmsendingthread: sent 4 status msgs to all nodes
20xx-12-16 20:36:14.066: [ cssd][2753111808]clssnmsendingthread: sending status msg to all nodes
…
根据日志能判断出bond信息变了吗?我当时没发现也没分析出来,最后同事说改了bond!当时不是说只换根网线重新排下线吗?我说改回去试试,果然如此,重启一切正常了
胡乱重启了下,没起来。。。
[root@myrac2 bin]# ./crsctl query crs activeversion
oracle cluster registry initialization failed accessing oracle cluster registry device: proc-26: error while accessing the physical storage
ora-15077: could not locate asm instance serving a required diskgroup
[root@myrac2 bin]# ./ocrcheck
prot-602: failed to retrieve data from the cluster registry
proc-26: error while accessing the physical storage
ora-15077: could not locate asm instance serving a required diskgroup
[grid@myrac2 ~]$ cd /u01/app/11.2.0/grid/bin/
[grid@myrac2 bin]$ srvctl start nodeapps -n myrac2
prcr-1070 : failed to check if resource ora.gsd is registered
cannot communicate with crsd
prcr-1070 : failed to check if resource ora.net1.network is registered
cannot communicate with crsd
prcr-1035 : failed to look up crs resource myrac2 for ora.cluster_vip.type
prcr-1068 : failed to query resources
cannot communicate with crsd
prcr-1070 : failed to check if resource ora.ons is registered
cannot communicate with crsd
[grid@myrac2 bin]$ srvctl start asm -n myrac2
prcr-1070 : failed to check if resource ora.asm is registered
cannot communicate with crsd
[grid@myrac2 bin]$ srvctl start database -d testdb2
prcd-1027 : failed to retrieve database testdb2
prcr-1115 : failed to find entities of type resource that match filters ((name == ora.testdb2.db) && (type == ora.database.type)) and contain attributes version,oracle_home,database_type
cannot communicate with crsd
[grid@myrac2 bin]$
节点2被修改的bond,明显跟1不一样
[root@myrac2 11.2.0]# service network status
configured devices:
lo bond0 bond1 em1 em2 em3 em4
currently active devices:
lo em1 em2 em3 em4 bond0 bond1
[root@myrac2 11.2.0]#
节点1
[root@myrac1 ~]# service network status
configured devices:
lo bond0 em1 em2 em3 em4 idrac
currently active devices:
lo em1 em2 em3 bond0
抛开技术行不行先不说,单这件事来说,同事之间的合作有时候更重要。一不小心你就会给别人挖个坑或掉到别人给你挖的坑