apache-log-parser
apache-log-parser copied to clipboard
Unable to pickle parsed output
I'm trying to do some multiprocess/distributed processing of apache logs, which uses serialization/deserialization via pickle for moving data between scheduler/worker processes.
However, deserialization fails on the parsed outputs, in my case specifically time_received_tz_datetimeobj
and time_received_utc_datetimeobj
, for input strings like:
import apache_log_parser
import pickle
mylist = ['157.55.39.31 - - [21/Mar/2019:07:56:41 +0000] "GET / HTTP/1.1" 200 6878 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"',
'40.77.167.37 - - [21/Mar/2019:07:59:11 +0000] "GET / HTTP/1.1" 301 469 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'
]
logparser = apache_log_parser.make_parser('%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"')
parsed_logline = logparser(mylist[0])
_ = pickle.dumps(parsed_logline)
# this causes error:
pickle.loads(_)
(This is in python 3.66, and apache log parser 1.7.0, by the way.)
I can fix this in my implementation by converting the '0000' timezone to UTC:
def to_utc(datetimeobj):
if str(datetimeobj.tzinfo) == "'0000'":
return datetimeobj.astimezone(datetime.timezone.utc)
else:
return datetimeobj
parsed_logline['time_received_tz_datetimeobj'] = to_utc(parsed_logline['time_received_tz_datetimeobj'])
parsed_logline['time_received_utc_datetimeobj'] = to_utc(parsed_logline['time_received_utc_datetimeobj'])
But this seems like something more appropriate to do in the parser. That said, I'm not sure if this would break backwards compatibility with other Python versions.