crawlers
crawlers copied to clipboard
collector v.3 - log4j2 - log-file per crawler
it is possible to configure a log-file per crawler as it worked in v.2? I tried the following config, but sd:type
does not get resolved. Thanks
<Configuration status="INFO" name="Norconex HTTP Collector">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout>
<pattern>%d{HH:mm:ss.SSS} [%t] %-5level %c{1} - %msg%n</pattern>
</PatternLayout>
</Console>
<RollingFile name="RollingFile" fileName="${env:NC_LOGDIR}/latest/logs/${sd:type}.log"
filePattern="${env:NC_LOGDIR}/backup/logs/$${date:yyyy}/$${date:MM}/$${date:dd}/${sd:type}.log.%i.gz">
<PatternLayout>
<Pattern>%d %p %c{1.} [%t] %m%n</Pattern>
</PatternLayout>
<Policies>
<OnStartupTriggeringPolicy />
</Policies>
</RollingFile>
</Appenders>
As you found out, in v3, the code base no longer controls log writing so people can implement logging however they want with their favourite logger implementation. For Log4J2, your approach looks conceptually good, but I believe variables are replaced before the RollingFile is created. To get around that, I think you have to use Routing, likely combined with filters. By default, the crawler id is printed out with each log line (except for logging that is not specific to a crawler). This means you can use a mix of filters and regular expressions to get around this.
Still, it is possible you will have a hard time getting it to work since while the crawler id is available in Log4j2 pattern layout resolution, I am not so sure about routing variable substitutions. For the latter, you may need to rely on Log4j2 MDC (Mapped Diagnostic Context). Unfortunately, the crawler ids are currently not set in the logger thread context. This has to be done explicitly in coding.
SLF4J (the logging abstraction framework used) supports MDC and will pass it to supporting logging implementations. For that reason, I am marking this as a feature request since I think we could take advantage of that. I'd like to make use of MDC in the code base to simplify routing to different files without the need for regular expressions or filters.
Once implemented, I'll share a Log4j2 configuration sample here.
Just to add, since the crawler name appears in the thread name, you can use the following variable in your routing:
${event:ThreadName}
See https://logging.apache.org/log4j/log4j-2.15.1/manual/lookups.html#EventLookup
Thanks Pascal! I had no luck with Routing and ended up with a workaround using an environment variable, which is set to the site name in the collector-http.sh
. I'll try event:ThreadName
and let you know, if that works.
Wouldn't that give you only 1 log file per collector, as opposed to one per crawler? I thought you were trying to get one log per crawler in cases where you have multiple crawlers defined in a single collector config. If you just want 1 log per collector it is already like that, you should just need to change the Console appender to a file-based one (which you can parameterize as you did). Alternatively, you can keep the default logging (to STDOUT), and redirect the command-line output to a file when you launch the script.
I just made a snapshot release that adds a few attributes to the logging context. They are:
-
crawler.id
→ the crawler id, as configured. -
crawler.id.safe
→ the crawler id encoded to be safe to use as a file name on any file system. -
collector.id
→ the collector id, as configured. -
collector.id.safe
→ the collector id encoded to be safe to use as a file name on any file system.
Using Log4j2, the following produces one file per crawler configured and any non-crawler-specific log entries goes in a collector log.
<Configuration status="INFO" name="my-collector-logs">
<Properties>
<Property name="pattern">%d{HH:mm:ss.SSS} [%t] %-5level %c{1} - %msg%n</Property>
</Properties>
<Appenders>
<Routing name="Routing">
<Routes pattern="$${ctx:crawler.id.safe}">
<Route>
<RollingFile
name="Cralwer-${ctx:crawler.id.safe:-${ctx:collector.id.safe}}"
fileName="/path/to/my/logs/${ctx:crawler.id.safe:-${ctx:collector.id.safe}}.log"
filePattern="/path/to/my/logs/${ctx:crawler.id.safe:-${ctx:collector.id.safe}}-%d{yyyyMMdd-HHmm}.log.gz">
<PatternLayout>
<pattern>${pattern}</pattern>
</PatternLayout>
<SizeBasedTriggeringPolicy size="10 MB" />
</RollingFile>
</Route>
</Routes>
<IdlePurgePolicy timeToLive="15" timeUnit="minutes"/>
</Routing>
</Appenders>
<Loggers>
<Logger name="com.norconex.collector.http" level="INFO" additivity="false">
<AppenderRef ref="Routing"/>
</Logger>
<Logger name="com.norconex.collector.core" level="INFO" additivity="false">
<AppenderRef ref="Routing"/>
</Logger>
<Logger name="com.norconex.importer" level="INFO" additivity="false">
<AppenderRef ref="Routing"/>
</Logger>
<Logger name="com.norconex.committer" level="INFO" additivity="false">
<AppenderRef ref="Routing"/>
</Logger>
<Logger name="com.norconex.commons.lang" level="INFO" additivity="false">
<AppenderRef ref="Routing"/>
</Logger>
<!-- ... -->
<Root level="INFO">
<AppenderRef ref="Routing"/>
</Root>
</Loggers>
</Configuration>
If we pretend having a collector id my-collector
which defines 3 crawlers with ids: my-crawler-A
, my-crawler-B
, and my-crawler-C
, then you would get in the /path/to/my/logs/
folder:
-
my-collector.log
-
my-crawler-A.log
-
my-crawler-B.log
-
my-crawler-C.log