pentaho-pdi-dataset icon indicating copy to clipboard operation
pentaho-pdi-dataset copied to clipboard

csv input data - sorting impossible with not used columns in mapping

Open peterborkuti opened this issue 5 years ago • 2 comments

Dear Matt,

When I am using csv file input for a unit test which contains two columns (for example "id" and "a"), but I am using only one of them in the mapping (for example "a") and I choose the other ("id") for sorting, an exception occurs:

2019/02/28 15:07:40 - Spoon - Caused by: org.pentaho.di.core.exception.KettleException: 
2019/02/28 15:07:40 - Spoon - Unable to get all rows for database data set 'addnumbers as text'
2019/02/28 15:07:40 - Spoon - -1
2019/02/28 15:07:40 - Spoon - 
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:226)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetGroup.getAllRows(DataSetGroup.java:133)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSet.getAllRows(DataSet.java:140)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.injectDataSetIntoStep(InjectDataSetIntoTransExtensionPoint.java:198)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.callExtensionPoint(InjectDataSetIntoTransExtensionPoint.java:126)
2019/02/28 15:07:40 - Spoon - 	... 8 more
2019/02/28 15:07:40 - Spoon - Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.core.row.RowMeta.compare(RowMeta.java:915)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:214)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.sort(TimSort.java:220)
2019/02/28 15:07:40 - Spoon - 	at java.util.Arrays.sort(Arrays.java:1512)
2019/02/28 15:07:40 - Spoon - 	at java.util.ArrayList.sort(ArrayList.java:1462)
2019/02/28 15:07:40 - Spoon - 	at java.util.Collections.sort(Collections.java:175)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	... 12 more

I debugged it and I think, here is the spot in the code: (DataSetCsvGroup.java from line 200)

      // Which fields are we sorting on (if any)
      //
      int[] sortIndexes = new int[ sortFields.size() ];
      for ( int i = 0; i < sortIndexes.length; i++ ) {
        sortIndexes[ i ] = outputRowMeta.indexOfValue( sortFields.get( i ) );
      }

      if ( !sortFields.isEmpty() ) {

        // Sort the rows...
        //
        Collections.sort( rows, new Comparator<Object[]>() {
          @Override public int compare( Object[] o1, Object[] o2 ) {
            try {
              return outputRowMeta.compare( o1, o2, sortIndexes );
            } catch ( KettleValueException e ) {
              throw new RuntimeException( "Unable to compare 2 rows", e );
            }
          }
        } );
      }

sortIndexes will not be empty, but sortIndexes[0] will be -1 and this will cause and ArrayIndexOutOfBounds exception in outputRowMeta.compare.

You may ask, why want I sorting the csv file base on a field, which is not in the mapping, but it seemed to me a normal use case. For example, I wanted to test a transformation which adds two numbers together:

id a b c
1 0 0 0
2 1 0 1

The input mapping would be the columns "a" and "b", sorted by "id" The golden mapping would be the columns "a", "b" and "c" sorted by "id".

I put all the files to reproduce this here: https://github.com/peterborkuti/pentaho-pdi-dataset-bug-01

Thank you for your wonderful plugin Péter

peterborkuti avatar Feb 28 '19 15:02 peterborkuti

Hi Péter,

Thank you very much for the use case. It's true that I hadn't considered it yet. I think we'll need to do something novel here like adding the sort columns temporarily until after sorting after which we should remove them again, just to make sure the columns don't end up in the test-transformation. Cheers, Matt

mattcasters avatar Jun 20 '19 10:06 mattcasters

I noticed that there is a similar problem at https://github.com/mattcasters/pentaho-pdi-dataset. Perhaps we can refer to this issue to find more context about the bug.

JenniferJohnson89 avatar Aug 26 '20 11:08 JenniferJohnson89