udacity_hadoop_intro
udacity_hadoop_intro copied to clipboard
regex doubt?
I am studying your hadoop code.But I find a problem: https://github.com/serebrov/udacity_hadoop_intro/blob/master/code_access_log_file_hits/mapper.py regex = '([(\d.)]+) ([^\s]+) ([^\s]+) [(.*?)] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)' Why did you write ([(\d.)]+) rather than ([\d.]+)
@shamomanba I think you are right and additional parentheses are not needed, maybe a typo.
The error is not critical though, since it still works, here is an example with ([(\d.)]+)
:
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> regex = '([(\d\.)]+) ([^\s]+) ([^\s]+) \[(.*?)\] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)'
>>> r = re.compile(regex)
>>> line='10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469'
>>> matches = r.match(line)
>>> matches.groups()
('10.223.157.186', '-', '-', '15/Jul/2009:15:50:35 -0700', 'GET', '/assets/js/lowpro.js', 'HTTP/1.1', '200', '10469')
And with ([\d.]+)
:
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> regex = '([\d\.]+) ([^\s]+) ([^\s]+) \[(.*?)\] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)'
>>> r = re.compile(regex)
>>> line='10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469'
>>> matches = r.match(line)
>>> matches.groups()
('10.223.157.186', '-', '-', '15/Jul/2009:15:50:35 -0700', 'GET', '/assets/js/lowpro.js', 'HTTP/1.1', '200', '10469')
So the result is the same in both cases.
Thank your answer. I understand it.