apache-log-parser icon indicating copy to clipboard operation
apache-log-parser copied to clipboard

Unable to pickle parsed output

Open evan-burke opened this issue 5 years ago • 0 comments

I'm trying to do some multiprocess/distributed processing of apache logs, which uses serialization/deserialization via pickle for moving data between scheduler/worker processes.

However, deserialization fails on the parsed outputs, in my case specifically time_received_tz_datetimeobj and time_received_utc_datetimeobj, for input strings like:

import apache_log_parser
import pickle 

mylist = ['157.55.39.31 - - [21/Mar/2019:07:56:41 +0000] "GET / HTTP/1.1" 200 6878 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"',
          '40.77.167.37 - - [21/Mar/2019:07:59:11 +0000] "GET / HTTP/1.1" 301 469 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'
         ]

logparser = apache_log_parser.make_parser('%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"')

parsed_logline = logparser(mylist[0])
_ = pickle.dumps(parsed_logline)
# this causes error:  
pickle.loads(_)

(This is in python 3.66, and apache log parser 1.7.0, by the way.)

I can fix this in my implementation by converting the '0000' timezone to UTC:

def to_utc(datetimeobj):
	if str(datetimeobj.tzinfo) == "'0000'":
		return datetimeobj.astimezone(datetime.timezone.utc)
	else:
		return datetimeobj

parsed_logline['time_received_tz_datetimeobj'] = to_utc(parsed_logline['time_received_tz_datetimeobj'])
parsed_logline['time_received_utc_datetimeobj'] = to_utc(parsed_logline['time_received_utc_datetimeobj'])

But this seems like something more appropriate to do in the parser. That said, I'm not sure if this would break backwards compatibility with other Python versions.

evan-burke avatar Mar 22 '19 00:03 evan-burke