stdlib Function to pad string with zeros

Function to pad string with zeros

Open ivan-pi opened this issue 3 years ago • 15 comments

Description

A function to pad a string with zeros or another symbol to a given width could be helpful in many cases, especially together with the to_string() function for integer to string conversion.

Currently available methods to achieve this are using concatenation

width = 32
str = "hello"
padded_str = repeat('0',width-len(str)')//str

and perhaps also internal file I/O

write(buffer,'(A32)', blank='zero') "hello"   ! doesn't work for some reason

I propose to add the following functions

! left - padding
function padl(str,width,symbol) result(padded_str)
  character(len=*), intent(in) :: str
  integer, intent(in) :: width
  character(len=1), intent(in), optional :: symbol
  character(len=max(len(str),width) :: padded_str
end function
! right - padding
function padr(str,width,symbol) result(padded_str)
  character(len=*), intent(in) :: str
  integer, intent(in) :: width
  character(len=1), intent(in), optional :: symbol
  character(len=max(len(str),width) :: padded_str
end function

The original string is returned if width < len(str). The default padding symbol could be either the blank symbol ' ' or '0'. Special treatment of a leading sign symbol '+'/'-' might be also desirable.

Prior art

Python: zfill (Another method to achieve padding in Python is using the string class method .format())
Julia: lpad, rpad
MATLAB: pad

Apr 23 '21 13:04 ivan-pi

I use this often to left-pad integers with zeros when enumerating filenames, but using the edit descriptors in format.

Do you think a variant of padl and padr that accepts integers would be useful in addition to the ones you proposed?

res = padl(23, 4) ! returns "0023"

Compare it to:

res = padl(to_string(23), 4) ! returns "0023"

(Edit: fixed the snippet above)

I like the proposed names. Rarely there's an opportunity to introduce such short but clear names.

Apr 23 '21 16:04 milancurcic

I use this often to left-pad integers with zeros when enumerating filenames, but using the edit descriptors in format.

That's my use case too. Currently I resort to the pattern:

integer :: istep,unit
character(len=32) :: step
write(step,'(I0.8)') istep
open(newunit=unit,file="output"//trim(step)//".txt")

Note your second call should be res = padl(to_string(23),4).

Concerning the name, I thought that padl/padr is more in line with the adjustl/adjustr Fortran intrinsic functions than the prefixed versions like Julia has.

I'm not yet sure if making zero the default padding symbol is the right thing to do. But I would certainly support introducing an integer to zero-padded string function. I use this in practically every code. Maybe borrowing the Python zfill name specifically for this purpose?

function zfill(x,w) result(res)
  integer, intent(in) :: x
  integer, intent(in) :: width
  character(len=:), allocatable :: res
end function

integer :: unit
open(newunit=unit,"output"//zfill(23,4)//".txt)    ! opens "output0023.txt"

Apr 23 '21 16:04 ivan-pi

Indeed, with the second snippet corrected, the difference is less drastic.

I usually like a separate function with a specific purpose (e.g. zfill), although the same thing can be accomplished with padl with not that much more verbosity. So now I'm split between the two approaches.

Apr 23 '21 16:04 milancurcic

Maybe instead of leaving the concatenation to the consumer, we can make an even more specific function like

function filename(name,num,extension) result(res)
character(len=*), intent(in) :: name
integer, intent(in) :: num, width
character(len=*), intent(in) :: extension
character(len=:), allocatable :: res
end function

where we would allow some primitive type of integer formatting (e.g. using something like {:08d} in Python). But at this point, I wonder why not just go back to the string buffer

character(len=14) :: filename
write(filename,'(A,I0.8,A)') "myfile", 23, ".txt"

So I think a general zero-padding function will give greater flexibility.

Irrespective of the default padding symbol, if we overload pad to work for integers, you could use padl(23,4,'0') which is pretty short.

My preferences:

If the default padding character is '0', we also overload it to work for integers, meaning padl(23,4) is equivalent to padl(23,4,'0').
If the default padding character is ' ', we introduce a separate function (called zfill,zpad,padz, enumz,zstr or something similar) specific to the file enumeration scenario.

So now we need some arguments to support/reject the blank symbol.

cc @LKedward @awvwgk

Apr 23 '21 18:04 ivan-pi

Maybe instead of leaving the concatenation to the consumer, we can make an even more specific function like

I'm using something similar for unit testing round-tripping in an IO library of mine where I need some temporary file names rather than scratch units in this particular context: https://github.com/grimme-lab/mctc-lib/blob/5603d588ac12fd20b6643527d7cf68fe8c05e070/test/test_read.f90#L420-L429 https://github.com/grimme-lab/mctc-lib/blob/5603d588ac12fd20b6643527d7cf68fe8c05e070/test/test_write.f90#L218-L227

Writing a small helper is especially easy in this scenario, because you are padding the integer printout to a fixed length and therefore know exactly the required size of the buffer.

As a first start, we could wrap internal file IO as function similar to to_string but allow the user to specify the format, which gives at least access to all intrinsic formatters:

character(len=:), allocatable :: str
str = to_string(42)  ! str == "42"
str = to_string(42, '(i0.4)')  ! str == "0042"
str = to_string(42, '(z0.4)')  ! str == "002A"

Using to_string(42, '(i0)') will be slower than to_string(42) because of the internal file IO in such an approach. We might have to parse the formatter as well to avoid overflows for to_string(42, '(i0.1000)') or similar.

Apr 23 '21 19:04 awvwgk

lenset, stretch, and atleast in M_strings, which duplicate some functionality for historical purposes are used for slightly different use cases, as shown in the examples in the man-pages in the link above.

There are a couple of other routines like uniq(3f) in M_io at the same github site that are related to adding a numeric suffix to a name, although I usually just use an internal WRITE() to add a numeric suffix in a loop when I want to overwrite like

 
CHARACTER(LEN=256) :: FILENAME
INTEGER :: I
I=12
WRITE(FILENAME,'("PREFIX.",i0.3)')i
WRITE(*,*)'OPEN FILE ',trim(FILENAME)
I=1000
WRITE(FILENAME,'("PREFIX.",i0.3)')i
WRITE(*,*)'OPEN FILE ',trim(FILENAME)
end

shows, with newer features like "i0" lets you get leading zeros but still get a good name if you go over the expected range of numbers.

Just some Fortran-based prior art with some variants from what was discussed that were/are needed at various times (put them all together with the intent of doing something like stdlib, but hoping efforts like this will do the work for me!). where the internal write with a format gives nice leading zeros

Apr 24 '21 01:04 urbanjost

I made a small demonstration of padding without the optional padding symbol.

module pad_mod
  implicit none
contains
  function padl(str,width) result(res)
    character(len=*), intent(in) :: str
    integer, intent(in) :: width
    character(len=max(len_trim(str),width)) :: res
    res = str
    res = adjustr(res)
  end function
  function padr(str,width) result(res)
    character(len=*), intent(in) :: str
    integer, intent(in) :: width
    character(len=max(len_trim(str),width)) :: res
    res = str
  end function
end module

program test_pad
  use pad_mod
  implicit none
  print *, "Hello"
  print *, padl("Hello",10)
  print *, padr("Hello",12), len(padr("Hello",12))
end program

Output of the program:

$ gfortran pad_mod.f90
$ ./a.out
 Hello
      Hello
 Hello                 12

I'm not really sure what is the purpose of padr with blanks. It is effectively the opposite of trim.

Apr 26 '21 15:04 ivan-pi

Depending on what our goals are, we could simply introduce pad (left-pad) as the opposite of trim. Just throwing it out on the table.

Right padding would be achieved with adjustl(pad(str,width)) (symmetric with respect to how left-trimming can be currently performed using trim(adjustl(str))).

Apr 26 '21 16:04 ivan-pi

I also found one previous Fortran implementation of zfill here (GPL Licensed).

Apr 26 '21 16:04 ivan-pi

This is what was proposed in the proposal for GSoC. This was designed with an aim to provide as much functionalities as possible to a user. It might vary from current state of agreement but let me still put it here.

Signature: pad(string, width, pad_with(Optional) = “ ”) Output: a new string_type object

Function has 2 versions: padl and padr (like adjustl and adjustr) If the width required after the execution of the function is greater than len(string) then a new string_type object with one side (either left or right) padded with pad_with character sequence is returned, otherwise a deep copy of input string is returned. Note that the length of pad_with must be greater than 0 and can be greater than 1. If pad_with has length greater than 1 then there are 2 possibilities of padding: padding from left to right or padding from right to left. Take a look at the example to understand this better:

padl(“pad this”, width = 15, pad_with = “!@#”):

1). “!@#!@#!pad this” (padding from left to right) 2). “#!@#!@#pad this” (padding from right to left) Function will provide both the options to users. Note that pad is different from adjustr and adjustl also because adjust* provide an inherent trimming in them whereas pad function doesn’t.

I think it is better to keep size of pad_with to 1 at this moment. If we agree to truncate the output (if needed) so that it is strictly of the asked length, then IMO it shouldn't be named as pad.

Can we also divide trim into 2: triml and trimr?: but then it would become inconsistent with trim intrinsic ultimately we will have to rename the function. I think adjustl and adjustr make use of something like padr(triml( string ), len(string)) and padl(trimr( string ), len(string)) respectively where optional argument pad_with is ' ' (1 space).

Jun 19 '21 19:06 aman-godara

I think it is better to keep size of pad_with to 1 at this moment.

I agree. If we can reach agreement on the behavior of padding strings where len(pad_with) > 1 is true, we can always add them later in a backward compatible way.

If we agree to truncate the output (if needed) so that it is strictly of the asked length, then IMO it shouldn't be named as pad.

For fixed length strings truncation will happen upon assignment, while allocatable strings will be re-allocated to the result length. We cannot change these rules. But I don't see any issue with the case when len(pad_with) == 1 is true.

I think adjustl and adjustr make use of something like padr(triml( string ), len(string)) and padl(trimr( string ), len(string)) respectively where optional argument pad_with is ' ' (1 space).

I guess these are equal in practice, but at the compiler level the implementation for the intrinsic functions might work slightly differently. The big difference is of course that adjustl and adjustr are functions used when working with fixed-length (or already allocated) strings, and the result length is determined implicitly from the actual string size. They don't offer the possibility to modify the length of the resulting string (used as an expression or before assignment occurs).

Can we also divide trim into 2: triml and trimr?: but then it would become inconsistent with trim intrinsic ultimately we will have to rename the function.

Regarding trim, you can already achieve trimming on both sides:

trim(string): right trim
trim(adjustl(string)): left trim

One way to condense the interfaces would be to extend trim with an optional argument:

function trim(string,left) result(res)
  character(len=*), intent(in) :: string
  logical, intent(in), optional :: left ! .false. by default
  integer, parameter :: reslen = len_trim(adjustl(string))
  character(len=reslen) :: res

Then we would would have:

trim(string): right trim
trim(string,left=.true.): left trim

which actually takes three more characters than the currently available method and also does not read sequentially compared to what really happens. If you leave out the keyword like trim(string,.true.) to make it shorter it is just confusing for the person reading the code. Separate functions (i.e. triml/trimr) are a better solution then. For the time being I would just avoid trim until someone opens a dedicated issue about it.

Jun 21 '21 11:06 ivan-pi

I wasn't able to explain myself in the best possible manner, excuse me for that.

By truncate I was referring to the case when a user gives something like this:

padl("    This string is of length 34   ", width = 20)

Then the output " This string is of length 34 " (length 34) is better than the output " This string is o" (truncated output of asked length 20) for pad function. If one decides to return latter then it shouldn't be named as pad IMO.

If we run the padl function as implemented in this comment on this same input, output will be " This string is of length 34" (length 30) which is also performing a truncation but without any loss of information.

Jun 21 '21 13:06 aman-godara

Yes, I believe we should not do a "hard truncation", but rather follow the behavior in other languages:

Julia:

Stringify s and pad the resulting string on the left with p to make it n characters (code points) long. If s is already n characters long, an equal string is returned. Pad with spaces by default."
MATLAB

pad(str,numberOfCharacters) adds space characters so the strings in newStr have the length specified by numberOfCharacters. If any strings in str have more characters than numberOfCharacters, then pad does not modify them.

Jun 21 '21 13:06 ivan-pi

@milancurcic, now that #441 added padl should we still pursue a specific function for zero-filling integers or should we settle on a general format function like #444 that requires parsing a format string?

Using the currently available tools a string like "000123" could be created using:

character(len=6) :: zstr
zstr = padl(to_string(123),6,'0')

compared with the intrinsic internal I/O

write(zstr,'(I0.6)') 123

Of course the major difference is the stdlib functions can be inlined, e.g. unit = open("my_file"//padl(...)), freeing the programmer from the need to declare a character buffer.

Personally, I'd still be interested in having a zfill function like Python does.

Jul 23 '21 10:07 ivan-pi

_****_

[ ] #

Jul 05 '22 22:07 thaila027

stdlib stdlib copied to clipboard

Function to pad string with zeros

stdlib
stdlib copied to clipboard