Duke icon indicating copy to clipboard operation
Duke copied to clipboard

I can't get the match of records in csv file

Open rahafshareef opened this issue 8 years ago • 2 comments

Dears, please help me to solve this case in the below as i couldn't find the match between same records in csv file ;

CSV file contains two records

id,country,capital,area 4202,"Malta","Valletta","320" 4202,"Malta","Valletta","320"

Noting;

i have configure xml file which name is "countries.xml"

0.7
<property type="id">
  <name>ID</name>
</property>

<property lookup="true">
  <name>NAME</name> 
  <comparator>no.priv.garshol.duke.comparators.QGramComparator</comparator>
  <low>0.09</low>
  <high>0.93</high>
</property>    
<property lookup="true">
  <name>AREA</name> 
  <comparator>AreaComparator</comparator>
  <low>0.04</low>
  <high>0.73</high>
</property>
<property lookup="true">
  <name>CAPITAL</name> 
  <comparator>no.priv.garshol.duke.comparators.QGramComparator</comparator>
  <low>0.12</low>
  <high>0.61</high>
</property>    
<csv>
  <param name="input-file" value="countries.csv"/>
  
  <column name="id" property="ID"/>
  <column name="country"
          property="NAME"
          cleaner="no.priv.garshol.duke.examples.CountryNameCleaner"/>
  <column name="capital"
          property="CAPITAL"
          cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
  <column name="area"
          property="AREA"/>
</csv>

and when i tried to call it from java code:

public static void main(String[] args) throws Exception { // TODO code application logic here Configuration config = ConfigLoader.load("countries.xml");

    Processor proc = new Processor(config);
    proc.addMatchListener(new PrintMatchListener(true, true, true, false,
            config.getProperties(),
            true));

   proc.deduplicate();
   proc.close();
}

the result is:

Total records: 2 Total matches: 0 Total non-matches: 2

rahafshareef avatar Dec 16 '16 22:12 rahafshareef

The problem is that the two IDs are the same, so when Duke compares the two records against one another, it thinks it's comparing a record with itself, and suppresses the match. If it didn't do this Duke would report every record as a duplicate of itself.

larsga avatar Dec 16 '16 22:12 larsga

thank you so much for your kind support dear.

please i need to ask you what the language that Duke support? or in another way, can Duke process Arabic language?.

rahafshareef avatar Dec 18 '16 19:12 rahafshareef