stdlib Unique values in a 1-D array and their locations

Unique values in a 1-D array and their locations

Open Beliavsky opened this issue 1 year ago • 2 comments

Given a 1-D array of character variables, integers, or a user-defined type such as dates, if there are repeated values one may want to create factors (the R term) that are integer variables corresponding to the unique values. Below is a small code that does this inefficiently for an array of character variables. A library subroutine that works for any data type for which an equality operator is defined would be useful.

module factors_mod
implicit none
private
public :: compress
contains
pure subroutine compress(words,factors,values)
character (len=*)         , intent(in)               :: words(:)   ! (n) 
integer                   , intent(out)              :: factors(:) ! (n) integer values corresponding to words(:)
character (len=len(words)), allocatable, intent(out) :: values(:)  ! unique values of words(:)
integer :: i,n,nfac,imatch
n = size(words)
if (size(factors) /= n) error stop "in compress, need size(words) == size(factors)"
if (n < 1) then
   allocate (values(0))
   return
end if
allocate (values(n))
values(1) = words(1)
factors(1) = 1
nfac = 1
do i=2,n
   imatch = findloc(values(:nfac),words(i),dim=1)
   if (imatch == 0) then
      nfac = nfac + 1
      factors(i) = nfac
      values(nfac) = words(i)
   else
      factors(i) = imatch
   end if
end do
values = values(:nfac)
end subroutine compress
end module factors_mod

program xfactors
use factors_mod, only: compress
implicit none
integer, parameter :: n = 5, nlen = 1
character (len=nlen) :: words(n)
character (len=nlen), allocatable :: values(:)
integer :: factors(n)
words = ["a","c","a","b","c"]
call compress(words,factors,values)
print "(a,*(1x,a))", "data:",words
print "(a,*(1x,i0))", "factors:",factors
print "(a,*(1x,a))", "values:",values
end program xfactors
! output:
! data: a c a b c
! factors: 1 2 1 3 2
! values: a c b

Jul 30 '22 14:07 Beliavsky

I wonder why you say that this code is inefficient. I tried to think of different implementations, but the code as written seems to be O(N**2), with sorted data you might get down to O(N*ln(N)), but that would presume the data can be sorted (in that case, you would need more functionality than just an equality operation). Just wondering if you have a particular algorithm in mind.

Op za 30 jul. 2022 om 16:57 schreef Beliavsky @.***>:

Given a 1-D array of character variables, integers, or a user-defined type such as dates, if there are repeated values one may want to create factors https://www.stat.berkeley.edu/~spector/s133/factors.html (the R term) that are integer variables corresponding to the unique values. Below is a small code that does this inefficiently for an array of characters. A library subroutine that works for any data type for which an equality operator is defined would be useful.

module factors_mod implicit none private public :: compress contains pure subroutine compress(words,factors,values) character (len=*) , intent(in) :: words(:) ! (n) integer , intent(out) :: factors(:) ! (n) integer values corresponding to words(:) character (len=len(words)), allocatable, intent(out) :: values(:) ! unique values of words(:) integer :: i,n,nfac,imatch n = size(words) if (size(factors) /= n) error stop "in compress, need size(words) == size(factors)" if (n < 1) then allocate (values(0)) return end if allocate (values(n)) values(1) = words(1) factors(1) = 1 nfac = 1 do i=2,n imatch = findloc(values(:nfac),words(i),dim=1) if (imatch == 0) then nfac = nfac + 1 factors(i) = nfac values(nfac) = words(i) else factors(i) = imatch end if end do values = values(:nfac) end subroutine compress end module factors_mod

program xfactors use factors_mod, only: compress implicit none integer, parameter :: n = 5, nlen = 1 character (len=nlen) :: words(n) character (len=nlen), allocatable :: values(:) integer :: factors(n) words = ["a","c","a","b","c"] call compress(words,factors,values) print "(a,(1x,a))", "data:",words print "(a,(1x,i0))", "factors:",factors print "(a,*(1x,a))", "values:",values end program xfactors ! output: ! data: a c a b c ! factors: 1 2 1 3 2 ! values: a c b

— Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRY7BRKCUS5X7S4HYZLVWU7GTANCNFSM55DLAWRQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Aug 04 '22 12:08 arjenmarkus

Thank you @Beliavsky for this proposition.

A library subroutine that works for any data type for which an equality operator is defined would be useful.

This can be easiliy done with fypp (see for example stdlib_sorting_sort.fypp. If you submit a PR with a first draft of this procedure, we can help you with fypp if needed.

My current main questions concern the module (name) in which such a procedure should go, and the API of this procedure

Aug 04 '22 20:08 jvdp1

stdlib stdlib copied to clipboard

Unique values in a 1-D array and their locations

stdlib
stdlib copied to clipboard