udacity_hadoop_intro icon indicating copy to clipboard operation
udacity_hadoop_intro copied to clipboard

regex doubt?

Open huangmiumang opened this issue 6 years ago • 2 comments

I am studying your hadoop code.But I find a problem: https://github.com/serebrov/udacity_hadoop_intro/blob/master/code_access_log_file_hits/mapper.py regex = '([(\d.)]+) ([^\s]+) ([^\s]+) [(.*?)] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)' Why did you write ([(\d.)]+) rather than ([\d.]+)

huangmiumang avatar Mar 28 '18 15:03 huangmiumang

@shamomanba I think you are right and additional parentheses are not needed, maybe a typo.

The error is not critical though, since it still works, here is an example with ([(\d.)]+):

$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> regex = '([(\d\.)]+) ([^\s]+) ([^\s]+) \[(.*?)\] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)'
>>> r = re.compile(regex)
>>> line='10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469'
>>> matches = r.match(line)
>>> matches.groups()
('10.223.157.186', '-', '-', '15/Jul/2009:15:50:35 -0700', 'GET', '/assets/js/lowpro.js', 'HTTP/1.1', '200', '10469')

And with ([\d.]+):

$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> regex = '([\d\.]+) ([^\s]+) ([^\s]+) \[(.*?)\] "(\w+) ([^\s]+) ([^\s]+)" (\d+) ([^\s]+)'
>>> r = re.compile(regex)
>>> line='10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469'
>>> matches = r.match(line)
>>> matches.groups()
('10.223.157.186', '-', '-', '15/Jul/2009:15:50:35 -0700', 'GET', '/assets/js/lowpro.js', 'HTTP/1.1', '200', '10469')

So the result is the same in both cases.

serebrov avatar Mar 28 '18 19:03 serebrov

Thank your answer. I understand it.

huangmiumang avatar Mar 30 '18 10:03 huangmiumang