oracle-db-appdev-monitoring icon indicating copy to clipboard operation
oracle-db-appdev-monitoring copied to clipboard

[BUG] docker container crash caused by "SIGSEGV: segmentation violation" or "corrupted size vs. prev_size"

Open MansuyDavid opened this issue 2 months ago • 13 comments

Description

one of our docker container regularly crash with this kind of messages SIGSEGV: segmentation violation. and dump the whole memory stack after that.

Steps to Reproduce

I'm not sure this is easily reproductible, but this container monitores 14 databases, and 6 of them are not reachable (last messages before crash) so maybe this is linked. version used is 2.0.4 for now.

time=2025-10-03T09:26:45.206Z level=ERROR source=collector.go:241 msg="Error pinging database" error="user=\"sys\" standalone params={authMode:2 connectionClass:<nil> connectionClassLength:0 purity:0 newPassword:<nil> newPasswordLength:0 appContext:<nil> numAppContext:0 externalAuth:0 externalHandle:<nil> pool:<nil> tag:<nil> tagLength:0 matchAnyTag:0 outTag:<nil> outTagLength:0 outTagFound:0 shardingKeyColumns:<nil> numShardingKeyColumns:0 superShardingKeyColumns:<nil> numSuperShardingKeyColumns:0 outNewSession:0}: ORA-12541: Cannot connect. No listener at host 10.1.1.1 port 1521.\nHelp: https://docs.oracle.com/error-help/db/ora-12541/" database=oracle_DBVISIT_DMDO

SIGSEGV: segmentation violation
PC=0x7f41da5f43e1 m=11 sigcode=1 addr=0x4e574f4e4b4e6d
signal arrived during cgo execution
...
[memory dump logs]
...
time=2025-10-03T09:27:45.631Z level=INFO source=main.go:75 msg="FREE_INTERVAL end var is not present, will not periodically attempt to release memory"

sometime i have this error :

time=2025-10-03T09:32:12.332Z level=ERROR source=collector.go:241 msg="Error pinging database" error="user=\"sys\" standalone params={authMode:2 connectionClass:<nil> connectionClassLength:0 purity:0 newPassword:<nil> newPasswordLength:0 appContext:<nil> numAppContext:0 externalAuth:0 externalHandle:<nil> pool:<nil> tag:<nil> tagLength:0 matchAnyTag:0 outTag:<nil> outTagLength:0 outTagFound:0 shardingKeyColumns:<nil> numShardingKeyColumns:0 superShardingKeyColumns:<nil> numSuperShardingKeyColumns:0 outNewSession:0}: ORA-12541: Cannot connect. No listener at host 10.1.1.1 port 1521.\nHelp: https://docs.oracle.com/error-help/db/ora-12541/" database=oracle_DBVISIT_DMDO

corrupted size vs. prev_size
SIGABRT: abort
PC=0x7fdcd3a555ef m=53 sigcode=18446744073709551610
signal arrived during cgo execution
...
[memory dump log]
...
time=2025-10-03T09:33:05.819Z level=INFO source=main.go:75 msg="FREE_INTERVAL end var is not present, will not periodically attempt to release memory"

Environment

  • OS: docker container amd64
  • Oracle Database version: 12 and 19
  • Exporter version: 2.0.4 but I will upgrade to 2.1.0 asap to see if this changes.

to discuss

I don't know if you need the memory dump to help you on that. I'm not even sure this must be investigated for now, because this issue happens only on 1 of my 13 container (the only one that monitore more than 5 DB). and the situation with 6 db not available can be linked to that.

I'm open to suggestions.

MansuyDavid avatar Oct 03 '25 10:10 MansuyDavid

Can you share the memory dump? I'd like to see where in cgo execution this failed. There are some known issues with godror/odpi-c that causes issues when managing largers numbers of connections.

anders-swanson avatar Oct 03 '25 14:10 anders-swanson

Hello,

This is the last dump we had this morning. oracle_exporter_dump.zip

as we are solving connection issues one by one It seems that the exporter is crashing less. So it is probably linked to connection issues.

MansuyDavid avatar Oct 07 '25 07:10 MansuyDavid

as shown bellow, decrease the connection issue seems to stabilize the exporter :

Image

MansuyDavid avatar Oct 07 '25 07:10 MansuyDavid

I have noticed the godror driver crashes frequently when acquiring connections in a multithreaded environment. We have coded some workarounds but it is not quite stable it seems... I am thinking of implementing an alternate build that uses the go-ora driver, as it's a pure go driver without the need for CGO.

Once that is implemented, would you like to test the exporter variant with go-ora?

anders-swanson avatar Oct 07 '25 13:10 anders-swanson

Yes i'll be happy to install it on my environments.

But if the actual stability increase with the connections setups (as shown in the graph) maybe I will not be able to detect a clear improvement with the new driver 🤷

MansuyDavid avatar Oct 07 '25 16:10 MansuyDavid

With the v2.2.0 release, the exporter can now be built with go-ora: https://oracle.github.io/oracle-db-appdev-monitoring/docs/advanced/go-ora

If needed, I can provide a build. Since go-ora support is experimental in this release, a go-ora build is not present on the releases page.

anders-swanson avatar Oct 29 '25 14:10 anders-swanson

Thank you for that, i am currently building your 2.2.0 with go-ora, and will install it asap.

About making it, I think that you forgot in the makefile to add to docker step "--build-arg" for TAGS and CGO_ENABLED. (I'm not familiar with makefile, I'm on windows, so i just run a "docker build")

I will let you know what I notice with this driver

MansuyDavid avatar Oct 30 '25 08:10 MansuyDavid

I have some errors that i am not able to correct.

time=2025-10-30T08:21:14.188Z level=ERROR source=collector.go:246 msg="Error pinging database" error="parse \"oracle://U_PROMETHEUS:***@(DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=PROMETHEUS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.2)(PORT=1521))))(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=PROMETHEUS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.3)(PORT=1521)))))\": invalid character \" \" in host name" database=oracle_PROD_DMLB
time=2025-10-30T08:21:14.188Z level=ERROR source=collector.go:246 msg="Error pinging database" error="parse \"oracle://sys:***@(DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.7)(PORT=1521))))(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.6)(PORT=1521)))))\": invalid character \" \" in host name" database=oracle_DBVISIT_DMDO

my config file hasn't changed and contains :

  oracle_PROD_DMLB:
    ## Database username
    username: U_PROMETHEUS
    ## Database password
    password: ***
    ## Database connection url
    url: (DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=PROMETHEUS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.2)(PORT=1521))))(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=PROMETHEUS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.3)(PORT=1521)))))
  oracle_DBVISIT_DMDO:
    ## Database username
    username: sys
    ## Database password
    password: ***
    ## Database connection url
    url: (DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.7)(PORT=1521))))(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.6)(PORT=1521)))))
    role: "SYSDBA"

I found that the url constructed in connec-goora is not the same as explained in go-ora documentation : https://github.com/sijms/go-ora/tree/v2.9.0, so i modified it to use go_ora.buildJDBC :

// BuildJDBC create url from user, password and JDBC description string
func BuildJDBC(user, password, connStr string, options map[string]string) string {
	if options == nil {
		options = make(map[string]string)
	}
	options["connStr"] = connStr
	return BuildUrl("", 0, "", user, password, options)
}

which create a dns formated like that :

dsn="oracle://U_PROMETHEUS:****@:0/?connStr=%28DESCRIPTION_LIST%3D%28FAILOVER%3DYES%29%28LOAD_BALANCE%3D+NO%29%28DESCRIPTION%3D%28CONNECT_TIMEOUT%3D3%29%28RETRY_COUNT%3D3%29%28CONNECT_DATA%3D%28SERVICE_NAME%3DPROMETHEUS%29%28SID%3DPROD%29%29%28ADDRESS_LIST%3D%28ADDRESS%3D%28PROTOCOL%3DTCP%29%28HOST%3D10.1.1.1%29%28PORT%3D1521%29%29%29%29%28DESCRIPTION%3D%28CONNECT_TIMEOUT%3D3%29%28RETRY_COUNT%3D3%29%28CONNECT_DATA%3D%28SERVICE_NAME%3DPROMETHEUS%29%28SID%3DPROD%29%29%28ADDRESS_LIST%3D%28ADDRESS%3D%28PROTOCOL%3DTCP%29%28HOST%3D10.1.1.2%29%28PORT%3D1521%29%29%29%29%29"

With that, some DB can be connected and scraped. 🎉 But only the one that has no "failover" between 2 instances in jdbc URL (so not the examples given previously) For the ones that have failover in jdbc url, o get 2 kinds of errors. (it seems that the order of the 2 servers in the jdbc url is something that has impact on which error i received, but not for all connections, for some connections, inverting both servers doesn't change the error) :

time=2025-10-30T10:54:52.489Z level=ERROR source=collector.go:246 msg="dmy: Error pinging database" error="ORA-12564: TNS connection refused" database=oracle_PROD_DMLB
time=2025-10-30T10:54:52.529Z level=ERROR source=collector.go:246 msg="dmy: Error pinging database" error="dial tcp 10.x.x.x:1521: connect: connection refused" database=oracle_PROD_DMDO
t

For the error ORA-12564, i found those logs in the TNS :

TNS-01153: Failed to process string: 
(DESCRIPTION_LIST=(FAILOVER=YES)(LOAD_BALANCE= NO)(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.x.x.x)(PORT=1521))))(DESCRIPTION=(CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(CONNECT_DATA=(SERVICE_NAME=SECOURS)(SID=PROD))(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.x.x.x)(PORT=1521)))))
NL-00305: the specified path name does not exist

I'm pretty sure the TNS should not receive the whole jdbcURL, it's the driver's job to unwrap this url and use only one part of it for each server. So I'm not sur that this driver can handle this kind of jdbc url.

When i looked at the parsing of the jdbc url in go-ora i found that this driver can find multiple "servers" and then use on global jdbc url, regex to find one "service_name", one "SID", one "instance_name". (which is ok for me as the same service_name is used on both servers but what about people that have different service_name on each servers?) :

func (op *ConnectionOption) UpdateDatabaseInfo(connStr string) error {
	op.connStr = connStr
	var err error
	op.Servers, err = extractServers(connStr)
	if err != nil {
		return err
	}
	if len(op.Servers) == 0 {
		return errors.New("no address passed in connection string")
	}
	r, err := regexp.Compile(`(?i)\(\s*SERVICE_NAME\s*=\s*([\w,\.,\-]+)\s*\)`)
	if err != nil {
		return err
	}
	match := r.FindStringSubmatch(connStr)
	if len(match) > 1 {
		op.DatabaseInfo.ServiceName = match[1]
	}
	r, err = regexp.Compile(`(?i)\(\s*SID\s*=\s*([\w,\.,\-]+)\s*\)`)
	if err != nil {
		return err
	}
	match = r.FindStringSubmatch(connStr)
	if len(match) > 1 {
		op.DatabaseInfo.SID = match[1]
	}
	r, err = regexp.Compile(`(?i)\(\s*INSTANCE_NAME\s*=\s*([\w,\.,\-]+)\s*\)`)
	if err != nil {
		return err
	}
	match = r.FindStringSubmatch(connStr)
	if len(match) > 1 {
		op.DatabaseInfo.InstanceName = match[1]
	}
	return nil
}

I'm not sure i can go deeper to find what is wrong on all of that.

Thank you in advance for your help.

MansuyDavid avatar Oct 30 '25 12:10 MansuyDavid

Yes, the go-ora connection string is handled differently than godror. I didn't see a way to make them the same.

If you are using username/password authentication, should be able to authenticate from go-ora like this:

databases:
  mydb:
    username: myuser
    pasword: ****
    url: <hostname>:<port>?<connection string>

anders-swanson avatar Oct 30 '25 16:10 anders-swanson

As go-ora support is currently experimental, we are looking for ways to improve it! And as you mention, there may be some limitations on the driver, which we'll need to dig deeper on. Thanks for providing this information.

anders-swanson avatar Oct 30 '25 16:10 anders-swanson

As i said, the db i can not connect are the one that have 2 servers, so I can not use host:port?connection string. (I had to adapt your code to call buildJDBC function but I didn't change anything in the config.yml, so they are the treated the same)

So I understand that the go-ora driver can not handle Complex jdbc URL with failover for now. If it is the case, I can not test it on my production environment, as 80% of the db i need to scrape are like that.

MansuyDavid avatar Oct 30 '25 18:10 MansuyDavid

It seems there are two options for this kind of use case:

  • if using godror, recommend multiple exporters if monitoring a large number of databases to reduce CGO crashes. More database connection pools is associated with a higher crash rate using the driver from the CGO layer. We have tried several mitigations from exporter layer, but there is only so much we can do when the crash originates from CGO.
  • contribute this functionality to go-ora. It seems this driver lacks support for some more complex connection strings, such as the failover options you are using.

@MansuyDavid would you mind creating a related issue in go-ora with your findings?

anders-swanson avatar Nov 03 '25 17:11 anders-swanson

I also recognize neither of these are immediate fixes - one is a workaround, and the other may take some time.

anders-swanson avatar Nov 03 '25 17:11 anders-swanson