Discussion:
Need ideas regarding identifying what could cause Error: 12546 [ASE 12.5.4 ESD#8, Sol x64]
(too old to reply)
mh
2009-01-26 10:24:39 UTC
Permalink
Raw Message
Hi guys!



(Environment info + error log is found at the end of this message)



Last week the HW of one of our ASE:s had a powerfailure. When the box
came back online and during recovery of one of the databases, we got
error 12546.



Ref:

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc00729_15
00/html/errMessageAdvRes/CHDFDEDD.htm



Disks (NFS) are on a NetApp filer (the filer did not loose power during
the incident). Data and log devices are read/written in the same NetApp
volume (NFS). The dataserver was fairly loaded when the incident
occurred. There was also a fulldump running for the database in question
when the incident occurred.



We contacted Sybase Technical Support and we were advised to restore the
database from backup. Luckily we have a warm-standby setup so we had
already earlier switched the active side and reconfigured our
applications.



Now I need ideas regarding identifying what could cause an error like
this. Either I'm doing something wrong (like some hidden cache somewhere
that I am not aware of) or there is a new issue related to CR320745.
Sybase Technical Support already verified I'm not hitting this specific
CR (corrected long time ago) but I guess there could be a new unknown
problem.



Ref:

http://search.sybase.com/kbx/changerequests?bug_id=320745



Thanks

/mh







We run:

Adaptive Server Enterprise/12.5.4/EBF 15446 ESD#8/P/Solaris AMD64/OS
5.10/ase1254/2105/64-bit/OPT/Sat Mar 22 04:05:46 2008



Active traceflags:

833, 7408, 7409



NFS mount options (as recommended by NetApp):

vers=3,proto=tcp,hard,intr



The related error log entry:

00:00000:00018:2009/01/20 08:28:20.68 server Recovering database
'<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:28:20.68 server Started estimating
recovery log boundaries for database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:28:27.09 server Recovering last page of
log (was 12178002, now 12178001)

00:00000:00018:2009/01/20 08:28:27.09 server Database '<MY_DATABASE>',
checkpoint=(12177652, 3), first=(12177652, 3), last=(12178001, 21).

00:00000:00018:2009/01/20 08:28:27.09 server Completed estimating
recovery log boundaries for database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:28:27.09 server Started ANALYSIS pass for
database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:28:27.09 server Completed ANALYSIS pass
for database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:28:27.09 server Started REDO pass for
database '<MY_DATABASE>'. The total number of log records to process is
5744.

00:00000:00018:2009/01/20 08:29:03.81 server Redo pass of recovery has
processed 557 committed and 0 aborted transactions.

00:00000:00018:2009/01/20 08:29:03.81 server Completed REDO pass for
database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:29:03.81 server Recovery of database
'<MY_DATABASE>' will undo incomplete nested top actions.

00:00000:00018:2009/01/20 08:29:03.81 server Started UNDO pass for
database '<MY_DATABASE>'. The total number of log records to process is
70.

00:00000:00018:2009/01/20 08:29:03.83 server Error: 12546, Severity:
18, State: 1

00:00000:00018:2009/01/20 08:29:03.83 server During undo the timestamp
to be assigned to a page is less than the current page timestamp. Page =
3096978, object id = 473049690, page timestamp = (0000 7cb948ae), new
page timestamp = (0000 7cb948a9).

00:00000:00018:2009/01/20 08:29:03.83 server Error: 12546, Severity:
18, State: 1

00:00000:00018:2009/01/20 08:29:03.83 server During undo the timestamp
to be assigned to a page is less than the current page timestamp. Page =
3096978, object id = 473049690, page timestamp = (0000 7cb948ae), new
page timestamp = (0000 7cb948a9).

00:00000:00018:2009/01/20 08:29:03.87 server Error: 12546, Severity:
18, State: 1

00:00000:00018:2009/01/20 08:29:03.87 server During undo the timestamp
to be assigned to a page is less than the current page timestamp. Page =
2102951, object id = 473049690, page timestamp = (0000 7cb948ad), new
page timestamp = (0000 7cb948aa).

00:00000:00018:2009/01/20 08:29:03.87 server Error: 12546, Severity:
18, State: 1

00:00000:00018:2009/01/20 08:29:03.87 server During undo the timestamp
to be assigned to a page is less than the current page timestamp. Page =
2102951, object id = 473049690, page timestamp = (0000 7cb948ad), new
page timestamp = (0000 7cb948aa).

00:00000:00018:2009/01/20 08:29:06.01 server Undo pass of recovery has
processed 1 incomplete transactions.

00:00000:00018:2009/01/20 08:29:06.01 server Completed UNDO pass for
database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:29:06.01 server Started recovery
checkpoint for database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:29:10.19 server Completed recovery
checkpoint for database '<MY_DATABASE>'.

00:00000:00018:2009/01/20 08:29:11.88 server Started filling free space
info for database '<MY_DATABASE>'.

01:00000:00018:2009/01/20 08:33:46.42 server Completed filling free
space info for database '<MY_DATABASE>'.

01:00000:00018:2009/01/20 08:33:46.43 server Started cleaning up the
default data cache for database '<MY_DATABASE>'.

01:00000:00018:2009/01/20 08:33:48.26 server Completed cleaning up the
default data cache for database '<MY_DATABASE>'.

01:00000:00018:2009/01/20 08:33:48.50 server Checking external objects.

01:00000:00018:2009/01/20 08:33:48.53 server The transaction log in the
database '<MY_DATABASE>' will use I/O size of 4 Kb.

01:00000:00018:2009/01/20 08:33:50.45 server The transaction log in the
database '<MY_DATABASE>' will use I/O size of 4 Kb.

01:00000:00018:2009/01/20 08:33:52.38 server Database '<MY_DATABASE>'
is now online.
Bret Halford [Sybase]
2009-01-26 16:22:35 UTC
Permalink
Raw Message
Is the corrupt database still around? The first thing
I'd want to look at are the timestamps on the last log records
that made it to disk.

The two most likely possibilities in my mind are

some form of i/o caching (ASE wrote the log record, received
confirmation it had been written, so went ahead and wrote the
corresponding data page - which somehow made it do disk, but the log
page didn't)

or


some as yet unknown bug in how recovery determines the timestamp
value to start using to generate new records (in the case of
an abrupt shutdown, this is determined by a scan of the log
to find the highest value used so far).

-bret
mh
2009-01-27 09:25:37 UTC
Permalink
Raw Message
Bret Halford [Sybase] wrote:
Is the corrupt database still around? The first thing I'd want to look
at are the timestamps on the last log records that made it to disk.


Hi Bret,

Thanks for helping me out. Yes, the database is still there. However
looking at the chain of events during the failure it seems that the
transaction log was dumped just after the database was (semi) recovered:

08:13 power failure
08:27 ASE restart
08:28 recovery of <MY_DATABASE> starts
08:29 Error 12546 printed in error log
08:33 recovery of <MY_DATABASE> is complete
08:37 transaction dump <MY_DATABASE>
08:45 full dump <MY_DATABASE> (manually aborted)

I take it you're suggesting the "dbcc log" command to examine the
transaction log. I my case I would have to restore the database from
backup with "until_time" about "08:14" and then online to recover the
database. Do you agree? Or could I use "until_rid" based on some info
given in the error log?

Thanks
/Matt
Bret Halford [Sybase]
2009-01-27 16:22:21 UTC
Permalink
Raw Message
Post by mh
Is the corrupt database still around? The first thing I'd want to look
at are the timestamps on the last log records that made it to disk.
Hi Bret,
Thanks for helping me out. Yes, the database is still there. However
looking at the chain of events during the failure it seems that the
08:13 power failure
08:27 ASE restart
08:28 recovery of <MY_DATABASE> starts
08:29 Error 12546 printed in error log
08:33 recovery of <MY_DATABASE> is complete
08:37 transaction dump <MY_DATABASE>
08:45 full dump <MY_DATABASE> (manually aborted)
I take it you're suggesting the "dbcc log" command to examine the
transaction log. I my case I would have to restore the database from
backup with "until_time" about "08:14" and then online to recover the
database. Do you agree? Or could I use "until_rid" based on some info
given in the error log?
Thanks
/Matt
Dump tran typically can't truncate the last 20 or so records, so the
beginning of a truncated log would contain the tail end of the previous.
However, I must say I'm surprised that the db wasn't marked suspect
by recovery, which would have prevented truncating the log.

-bret
mh
2009-01-28 09:43:35 UTC
Permalink
Raw Message
From: Bret Halford [Sybase] wrote:
Dump tran typically can't truncate the last 20 or so records, so the
beginning of a truncated log would contain the tail end of the previous.

Hi Bret,

It seems the current log doesn't contain any older records. I will try
to load the old backups with until_time 08:14 and examine the log using
"dbcc log" just for the fun of it.

/Matt


$ head dbcc_log.log
LOG SCAN DEFINITION:
Database id : 9
Forward scan: starting at begining of log

$ egrep '(BEGINXACT|ENDXACT|time=)' dbcc_log.log | head -5
ENDXACT (12178218,0) sessionid=12178217,16
endstat=COMMIT time=Jan 20 2009 8:33:31:593AM
BEGINXACT (12178218,1) sessionid=12178218,1
xstat=XBEG_ENDXACT,XBEG_USERINFO_EXT,
name=$user_transaction time=Jan 20 2009 8:33:31:593AM

$ grep " ts=" dbcc_log.log | head -5
old ts=0x0000 0x79235a2a new ts=0x0000 0x7cb94c0a
old ts=0x0000 0x6c121457 new ts=0x0000 0x7cb94c0c
old ts=0x0000 0x7b32d015 new ts=0x0000 0x7cb94c0e
old ts=0x0000 0x73b39864 new ts=0x0000 0x7cb94c10
old ts=0x0000 0x79fd3c00 new ts=0x0000 0x7cb94c12
mh
2009-01-28 11:01:57 UTC
Permalink
Raw Message
I just wrote:
I will try to load the old backups [...]


I decided I will try to reproduce the problem instead by simulating what
happened on the day we had the power failure.

/Matt
mh
2009-02-09 21:42:42 UTC
Permalink
Raw Message
FYI: We did not manage to reproduce the problem.

/Matt

Derek Asirvadem
2009-01-27 11:44:04 UTC
Permalink
Raw Message
My two cents. We have two corrupt pages, so we are looking for the
cause, not verifying a corrupt page report.
Post by Bret Halford [Sybase]
some form of i/o caching (ASE wrote the log record, received
confirmation it had been written, so went ahead and wrote the
corresponding data page - which somehow made it do disk, but the log
page didn't)
Quite reasonably likely in the case of NFS. If the /nfs filesystem was
not hosted directly from the machine ASE resides on, then definitely
likely.
Post by Bret Halford [Sybase]
some as yet unknown bug in how recovery determines the timestamp
value to start using to generate new records (in the case of
an abrupt shutdown, this is determined by a scan of the log
to find the highest value used so far).
Highly unlikely in the case of 12.5.4

----------

Re "cause" and how to prevent it. I would not deploy ASE on a [not
directly hosted] NFS fs. If I wanted a fast channel through the I/O
subsystem on Solaris, I would use raw devices and fibre (to direct disk
but most people these days go for SAN). There is no comparison between
the recoverability of raw devices vs /fs, let alone /nfs. The thread
is of great interest, but the exercise would merely prove that the most
probable reason, was in fact the reason, that caused the corruption in
two pages of the database, in spite of a dataserver that is obsessive
about recoverabilty. Personally [once you have experienced the latency
of each via Solaris utilities] I wouldn't spend the time identifying
the size of the nail and the moment is time that it punctured my cheap
tyre; I would just buy 5 serious tyres.

Out of interest, do a dbcc page (dbid, 3096978) & (2102951) and tell us
the object_name(). One hundred Euros says they are data not log pages.
--
Regards
Derek Asirvadem
Director / Senior Sybase DBA / Information Architect
Sybase BSA [OEM/VAR] Partner
Copyright © 2008 Software Gems Pty Ltd
mh
2009-01-27 13:48:45 UTC
Permalink
Raw Message
Derek Asirvadem wrote:
Out of interest, do a dbcc page (dbid, 3096978) & (2102951) and tell us
the object_name().

Thanks Derek,
I have attached the output of the dbcc page. The object_name() is
"link_history".

/Matt
Derek Asirvadem
2009-01-28 08:25:10 UTC
Permalink
Raw Message
Post by mh
I have attached the output of the dbcc page. The object_name() is
"link_history".
That was just to prove it was Bret's first scenario.
--
Cheers
Derek
Senior Sybase DBA / Information Architect
Copyright © 2008 Software Gems Pty Ltd
--
With the financial meltdown, consolidating many databases into one
server and managing mixed load is a demand. Ask people who have been
doing it for years.
Loading...