Using monitoring tables & trace/audit for system analysis



Pavel Císař, IBPhoenix

Firebird Conference 2014, Prague

_images/license.png

Thanks for sponsoring

_images/conf2014.png

Problem domain

Whenever you need Firebird under microscope

Difference between mon. tables and trace/audit

Monitoring tables

Monitoring tables

For what are they good for?

Limitations and Known Issues

Working with monitoring tables

The information available

MON$DATABASE

MON$ATTACHMENTS

MON$TRANSACTIONS

MON$STATEMENTS (prepared statements)

Columns TRANSACTION_ID and TIMESTAMP contain valid values for active statements only.

Colum SQL_TEXT is empty for BLR requests (for example some from ISQL).

MON$CALL_STACK

Not very interesting/useful.

Useful in rare cases when long runing SQL statement is PSQL and you want to know where exactly is stuck in call chain.

MON$IO_STATS

MON$RECORD_STATS (record-level statistics)

MON$MEMORY_USAGE

MON$MEMORY_USAGE (cont.)

Not all records have non-zero values. On the whole, only MON$DATABASE and memory-bound objects point to non-zero “allocated” values. Small allocations are not allocated at this level, being redirected to the database memory pool instead.

MON$CONTEXT_VARIABLES

Useful operations (categories)

Finding out what's going on right now

Finding out - Using FDB

Finding out - PowerConsole

Finding out - PowerConsole examples

1

_images/pwc1.png

2

_images/pwc2.png

Other useful operations

Monitoring database parameters

Monitoring attachment parameters

Monitoring memory consumption

Detecting long running queries and transactions

Finding the GC blocking transaction (and application)

Page cache utilisation

Data access/manipulation patterns

Cancel a running query or terminate a client

It makes sense to wrap them into an application

Introducing FBMon

Flexible printout of data

Output configuration - example

[transactions]
; 'collect' : default columns
collect   = row_id as #, id, attachment.id, timestamp, top,
            isolation_mode,  lock_timeout, isactive(),
            isreadonly()
; 'collect_x' : numbered column groups
collect_1 = snapshot.id, snapshot.timestamp
collect_2 = id, attachment.id
collect_3 = timestamp
collect_4 = snapshot.timestamp - this.timestamp as duration
collect_5 = top, isolation_mode, lock_timeout, isactive(), isreadonly()
collect_6 = count(statements), count(variables)
collect_7 = iostats.backouts, iostats.purges, iostats.expunges

1

_images/fbmon1.png

2

_images/fbmon2.png

Logging data

Log configuration - example

[log_trans_count]
dataset  = db
collect  = snapshot.id, snapshot.timestamp, next_transaction,
           diff(next_transaction) as trans_count
filename = %(tr_logs)s/trans_count.log
format   = list
mode     = append_rotate

[log_statement_count]
dataset  = monitor
collect  = snapshot.id, snapshot.timestamp,
           count(this.statements) as st_count,
           count([x for x in this.statements
                  if not x.transaction]) as pr_count
filename = %(here)s/st_count.log
format   = csv_hdr
mode     = append_rotate

Log visualisation - example

_images/ipqt1.png

User-defined checks

Check configuration - examples

[check_long_running_transaction]
dataset = transactions
; Report any trans. longer than 60 minutes but do not report OAT
check   = (vars.oat.id != this.id) and
          (snapshot.timestamp > this.timestamp +
           datetime.timedelta(minutes=60))
vars    = mon.get_transaction(mon.db.oat) as oat
actions = report
message = "Long running transaction %%d (%%s) from %%s:%%s" %%
          (this.id, snapshot.timestamp - this.timestamp,
           this.attachment.remote_address,
           this.attachment.remote_process)

[check_sweep]
dataset = db
check   = (this.sweep_interval == 0) and
          (this.ost - this.oit >= 100000)
actions = sweep, report
message = "Manual sweep needed (gap %%d)" %% (this.ost - this.oit)

And action!

Action scheduling

Action lifetime

Action configuration - examples

[action_sweep]
class    = program
run_mode = async
command  = /opt/firebird/bin/gfix -sweep -user %(user)s
           -password %(password)s %(database)s
calm     = 10

[action_alert]
class    = report
template = ALERT: %%(message)s
remind   = 5

[action_terminate_client]
class   = terminate
mode    = attachment
delay   = 10
timeout = 1

What could be done with checks & actions

That's all for monitoring tables...

Thanks for sponsoring

_images/conf2014.png

Trace & audit services

Trace & audit

For what it is good for?

User trace vs. System Audit

Limitations and Known Issues

Trace output example

2014-05-28T11:21:04.2360 (5776:0000ED5470) EXECUTE_PROCEDURE_FINISH
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST_DB\App\app.exe:2644
      (TRA_1596, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)

Procedure SOMEPROC:
param0 = varchar(10), "XXX"
param1 = double precision, "313210"
param2 = double precision, "24"
param3 = integer, "3"

    1 ms, 15 fetch(es)

Table      Natural Index Update Insert Delete Backout Purge Expunge
*******************************************************************
TABLE_001              1
TABLE_002              1
TABLE_003              1

Working with user trace sessions (and Audit)

The information available

Information about Trace sessions

2014-05-28T11:21:03.4560 (5776:0000000000ED67A0) TRACE_INIT
      SESSION_1 Firebird Audit

--- Session 1 is suspended as its log is full ---
2014-05-23T11:01:24.8080 (3720:0000000000EFD9E8) TRACE_INIT
      SESSION_1

2014-05-23T11:01:24.8080 (3720:0000000000EFD9E8) TRACE_FINI
      SESSION_1

Connections

2014-05-28T11:21:03.4870 (5776:0000000000ED5470) ATTACH_DATABASE
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:2644

2014-09-24T14:46:15.0350 (2453:2a04910) UNAUTHORIZED ATTACH_DATABASE
      /home/db/employee25.fdb (ATT_0, sysdba, NONE, TCPv4:127.0.0.1)
      /opt/firebird/bin/isql:8723

2014-05-23T11:01:24.8080 (3720:0000000000EFD9E8) DETACH_DATABASE
      TEST_DB (ATT_8, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:6072

Transactions

2014-05-23T11:00:28.6160 (3720:0000000000EFD9E8) START_TRANSACTION
      TEST_DB (ATT_8, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:6072
        (TRA_1568, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)

2014-05-23T11:00:29.9570 (3720:0000000000EFD9E8) COMMIT_TRANSACTION
      TEST_DB (ATT_8, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:6072
        (TRA_1568, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)
    0 ms, 1 read(s), 1 write(s), 1 fetch(es), 1 mark(s)

Statements

2014-05-28T11:21:03.7680 (5776:0000000000ED5470) PREPARE_STATEMENT
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:2644
        (TRA_1596, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)

Statement 110:
-------------------------------------------------------------------------------
EXECUTE PROCEDURE PROC_001('37720', '3', '1', 'LIL')
  269 ms

Triggers

2014-05-28T11:21:03.7680 (5776:0000ED5470) EXECUTE_TRIGGER_START
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:2644
        (TRA_1596, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)
      BI_TRIGGER_001 FOR TABLE_001 (BEFORE INSERT)

2014-05-28T11:21:03.7680 (5776:0000ED5470) EXECUTE_TRIGGER_FINISH
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:2644
        (TRA_1596, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)
      BI_TRIGGER_001 FOR TABLE_001 (BEFORE INSERT)
    0 ms, 1 fetch(es), 1 mark(s)

Stored procedures

2014-05-28T11:21:03.7680 (5776:00000ED5470) EXECUTE_PROCEDURE_FINISH
      TEST_DB (ATT_10, SERVER:NONE, ISO88591, TCPv4:192.168.1.5)
      D:\TEST\App\MyApp.exe:2644
        (TRA_1596, READ_COMMITTED | REC_VERSION | WAIT | READ_WRITE)

Procedure PROC_001:
param0 = varchar(50), "758755"
param1 = varchar(10), "LIL"

    0 ms

Errors

Basically the same thing that's written to firebird.log with some additinal information.

2014-09-24T13:58:48.7230 (2453:04910) ERROR AT jrd8_attach_database
      /home/db/BADDB.FDB (ATT_0, sysdba, NONE, TCPv4:127.0.0.1)
      /opt/firebird/bin/isql:8174
335544335 : database file appears corrupt (/home/db/BADDB.FDB)
335544650 : wrong page type
335544403 : page 0 is of wrong type (expected 3, found 1)

Sweep

2014-09-24T13:49:17.1970 (2453:0x7fed02a04910) SWEEP_START
      /home/db.fdb (ATT_127, SYSDBA:NONE, NONE, TCPv4:127.0.0.1)
      /opt/firebird/bin/gfix:7892

Transaction counters:
      Oldest interesting       1063
      Oldest active            1064
      Oldest snapshot          1064
      Next transaction         1065

2014-09-24T13:49:17.2240 (2453:0x7fed02a04910) SWEEP_PROGRESS
      /home/db.fdb (ATT_127, SYSDBA:NONE, NONE, TCPv4:127.0.0.1)
      /opt/firebird/bin/gfix:7892
    0 ms, 5 fetch(es)

BLR & DYN

You will never need those unless you're Firebird core developer or GPRE user.

Note: BLR & DYN records could be quite long, so set max_blr_length & max_dyn_length trace options accordingly.

Practical usage of Trace & Audit

Operational modes

General advices

Useful operations (categories)

Trace specific activities

System profiling

System audit

How I use Trace & Audit

Single purpose custom scripts

In Python it's piece of cake:

  • FDB provides convenient access to Firebird Services, including work with user trace sessions
  • Python is great for work with any kind of data (scientists love it for good reasons)

Example:

Configurable CLI application that collects execution PLANs over time and creates report of usage frequency and list of used / unused indices has only 115 lines of Python code.

strace.py

strace is universal tool for User Trace logging to screen or text file for later processing.

usage: strace.py [-h] [-o HOST] [-u USER] [-p PASSWORD] [-i INCLUDE]
  [-e EXCLUDE] [--connections] [--transactions]
  [--statements {prepare,plan,start,finish,free} [{prepare,plan,start,finish,free} ...]]
  [--procedures {start,finish} [{start,finish} ...]]
  [--triggers {start,finish} [{start,finish} ...]] [--perf]
  [--context] [--errors] [--sweep] [--blr] [--blr-requests]
  [--dyn] [--dyn-requests] [--time-threshold TIME_THRESHOLD]
  [--max-sql-length MAX_SQL_LENGTH]
  [--max-blr-length MAX_BLR_LENGTH]
  [--max-dyn-length MAX_DYN_LENGTH]
  [--max-arg-length MAX_ARG_LENGTH]
  [--max-arg-count MAX_ARG_COUNT]
  database [FILENAME]

mtrace.py

You can't easily stop code that runs user trace session. Best way is to use other process to stop it.

mtrace can list, suspend, resume or stop user trace sessions.

usage: mtrace.py [-h] [-o HOST] [-u USER] [-p PASSWORD] [-l]
                 [-s SESSION_ID] [-r SESSION_ID] [-c SESSION_ID]

trace Python library

trace Python library (cont.)

Example Audit log:

Interactive work with parsed trace

_images/ipqt2.png

Attachments

_images/ipqt3.png

Transactions

_images/ipqt4.png

Statements

_images/ipqt5.png

Table access details

_images/ipqt6.png

TraceEventSequence(s)

TraceEventGroups

TraceEventGroup

_images/ipqt7.png

Tree of events

1

_images/ipqt8.png

2

_images/ipqt9.png

Here begins the real fun but I'm pretty sure that we have run out of time

Thanks for sponsoring

_images/conf2014.png