quantmod OHLC returns wrong/duplicate columns

Description

The issue occurs when multiple columns in the data set match the OHLC-names. The OHLC function returns each matching column. In the test case the data set contains a column from Stochastic oscillator (TTR function) with the name "slowD". As a result OHLC returns; "AAPL.Open", "AAPL.High", "AAPL.Low", "slowD", "AAPL.Close" (from a data set with 20 columns).

Expected behavior

Expect the OHLC function to return the Open, High, Low and Close data, nothing else.

Minimal, reproducible example

aapl <- quantmod::getSymbols("AAPL", from = "2019-01-01", auto.assign = FALSE)
stoch <- TTR::stoch(aapl[, c("AAPL.Close", "AAPL.Low", "AAPL.Close")])
x <- merge(aapl, stoch)
colnames(quantmod::OHLC(x))

May 18 '20 12:05 nocodehummel

In the source the "Low" column is found by grep('Low',colnames(x),ignore.case=TRUE).

Possible solution(s) can be to first attempt more strict matches, alternative to always return the first result grep('Low',colnames(x),ignore.case=TRUE)[1].

May 18 '20 12:05 nocodehummel

This is similar to #24. Have you thought through the discussion there?

May 18 '20 15:05 joshuaulrich

Hi, sorry I only checked against the open issues. Your suggestion to implement with test coverage makes sense. Would it be okay to use testthat package (current test are not using it)? If you are open to implement the improvement, I can prepare some tests.

May 18 '20 18:05 nocodehummel

No worries! I only wanted you to be aware of the history, so you could take it into account while thinking about the issue.

It would be great if you could write some tests, especially if you can think of any other edge cases that aren't mentioned in the other issue.

You can write them in the testthat style and/or use testthat locally. But I would use tinytest instead, so don't spend time getting the package to build/check with testthat.

May 18 '20 18:05 joshuaulrich

In preparing test cases I do not get a positive result from has.Op using an object returned by getSymbols with column name changed to "Op". Given first three lines I would expect a TRUE (see below). In general attr(x, "some_attribute") returns NULL while attributes show the specific attribute. Btw is has.Op inconsistent with Op in this perspective? What is the expected behaviour?

`has.Op` <-
function(x,which=FALSE)
{
  colAttr <- attr(x, "Op")
  if(!is.null(colAttr))
    return(if(which) colAttr else TRUE)

  loc <- grep('Open',colnames(x),ignore.case=TRUE)
  if(!identical(loc,integer(0))) {
    return(if(which) loc else TRUE)
  } else FALSE
}

May 19 '20 09:05 nocodehummel

Based upon the test cases analysed from #24 and #305, the following solution should work.

`has.Op` <-
function(x,which=FALSE)
{
  cols <- colnames(x)
  n <- which(cols %in% c("Op", "Open"))
  
  if(identical(n,integer(0))) {
    n <- grep('Open$',cols, ignore.case = TRUE)
  }
  
  return(if(which) n else !identical(n,integer(0)))
}

And below solution for the related function (avoiding code duplication).

`Op` <-
function(x)
{
  col <- has.Op(x, which = TRUE)
  if(!identical(col, integer(0))) {
    return(x[, col])
  } else {
    stop('subscript out of bounds: open column not found')
  }
}

If you think this makes sense I can prepare a pull request tomorrow with the changes for OHLC functions and including tinytest cases.

May 19 '20 18:05 nocodehummel

Created pull request #306 to improve the column matching (with test cases).

May 25 '20 10:05 nocodehummel

Created pull request #306 to improve the column matching (with test cases).

I'm sorry I didn't respond earlier. I started working on the comments below on Saturday morning (which was still later than I should have). I'm going to leave these comments here, and I'll add the relevant comments to the PR.

Please keep in mind that I had most of the points below in mind before you created your PR. They're based on the code in your comments in this issue, so some points might not apply to your PR.

If you think this makes sense I can prepare a pull request tomorrow with the changes for OHLC functions and including tinytest cases.

I would appreciate a PR! Please take a look at the contributing guide before you get started.

Some notes to keep in mind:

Please do not remove the column attribute handling in the has.* functions. That functionality has been there since 2011. Removing it could break existing code.
n <- which(cols %in% c("Op", "Open")) will usually be integer(0). I don't feel good about that code being on the happy path when it's a special case. Not for performance reasons, but because it makes the function intent less clear.
It's extremely helpful to have tests before starting to refactor code. Hopefully, the code already has tests, but that wasn't true in this case. So I would have written tests on the existing code first, to make sure I didn't break things when I was refactoring.

May 26 '20 12:05 joshuaulrich

Thanks for your feedback. I will add the response to the PR #306.

May 27 '20 07:05 nocodehummel

I have run into this bug too....

Have you considered using a REGEX for your pattern matching in grep? You could return "T" for any instance of "low" UNLESS it is followed by a period. For instance:

grepl(
  "low($|[^.]).*", 
  c("LOW.High", "LOW.Low", "Low", "low", "MSFT.Low", "slow", "Lower."),
  ignore.case = T
)

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

In the pattern low($|[^.]).*, low means match the letters l, o, w, in order. ($|[^.]) means then match the end of the string (the '$") or any character except a period (^ within [ ] means match anything other the characters following ^ within the [ ]). The . outside of [ ] means match any character at all and the * means zero of more times. So basically this tells the parser to "match 'low' followed by either the end of the string or as many other characters as you like, but just not a period.

I can think of one instance when this solution would fail: a ticker containing "LOW" followed by another letter. For instance, if there were a ticker "FLOWR", the bug would re-emerge.

The only other way I can see to solve this would be to match only columns that end in "low," which seems like a reasonable permanent solution. To do that, use "low$" as your grep pattern:

grepl(
  "low($|[^.]).*", 
  c("LOW.High", "LOW.Low", "Low", "low", "MSFT.Low", "slow", "Lower."),
  ignore.case = T
)

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Perhaps implement the first solution now and warn users that in a future version, the column names will need to end with the key search words.

Jun 09 '20 08:06 gcpoole

Thanks for your feedback. The proposed solution in the pull request is to catch exact matches first and secondly perform a regex with 'grep('Low$', colnames(x), ignore.case = TRUE)'. Similar to your second option. The first option fails for example on the symbol of this ETF.

The proposed solution works with how the columns are added in other quantmod functions, and by being more strict reduces the risk of interference with other packages/code. The downside is a risk to break code that people may have implemented based on the current matching.

Jun 10 '20 10:06 nocodehummel