java-LSH icon indicating copy to clipboard operation
java-LSH copied to clipboard

Examples show boolean vectors, what about string vectors?

Open stenpiren opened this issue 6 years ago • 4 comments

Hi, I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each. The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.

stenpiren avatar Oct 24 '19 11:10 stenpiren

Excellent point. It would help to have examples converting strings and documents to vectors using something like word2vec but more lightweight.

Something like this but with a fully coded example:

https://link.medium.com/xF9m59LMr1

linusjf avatar Nov 10 '19 14:11 linusjf

I use this approach:


import java.nio.ByteBuffer;
import java.util.Base64;
import java.util.stream.IntStream;

import info.debatty.java.lsh.MinHash;

/**
 * Hello world!
 */
public class App {

    //https://github.com/tdebatty/java-LSH
    public static void main(String[] args) {
        int signature_size = 128;
        int dictionary_size = 256;
        int initial_seed = 1234567890;

        MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed);
        String pass1 = "Nikita";
        String pass2 = "Nikika";
        String pass3 = "Nikika1995strong";
        String pass4 = "qweertyuiop";

        var sig1 = mh.signature(fromString(pass1));
        var sig2 = mh.signature(fromString(pass2));
        var sig3 = mh.signature(fromString(pass3));
        var sig4 = mh.signature(fromString(pass4));

        System.out.println("sig1-sig2 : " + mh.similarity(sig1, sig2));
        System.out.println("sig1-sig3 : " + mh.similarity(sig1, sig3));
        System.out.println("sig1-sig4 : " + mh.similarity(sig1, sig4));

        println(sig1);
        println(sig2);
        println(sig3);
        println(sig4);
    }

    public static boolean[] fromString(String str) {
        var bytes = str.getBytes();
        var result = new boolean[256];
        int index = 0;
        for (int i = 0; i < str.length(); i++, index += 8) {
            var bits = byteToBoolArr(bytes[i]);
            System.arraycopy(bits, 0, result, index, bits.length);
        }
        return result;
    }

    public static boolean[] byteToBoolArr(byte b) {
        var boolArr = new boolean[8];
        for (int i = 0; i < 8; i++) {
            boolArr[i] = (b & (byte) (128 / Math.pow(2, i))) != 0;
        }
        return boolArr;
    }

    static void println(final int[] array) {
        System.out.println(convertToBase64(array));
    }

    public static String convertToBase64(int[] ints) {
        ByteBuffer buf = ByteBuffer.allocate(ints.length);
        IntStream.of(ints).forEach(i -> buf.put((byte)i));
        return Base64.getEncoder().encodeToString(buf.array());
    }
}

nanshakov avatar Mar 26 '20 19:03 nanshakov

This is a very interesting method and I would like to use it in my project. At the moment I am using ngrams to create boolean vectors, perhaps this works better and faster.

I do have a few questions/observations:

  1. You copy 8 bits into the result array for every character in the string, this means the strings have a maximum length?
  2. I assume the signature size and the number 128 in byteToBoolArr() are the same variable?
  3. When does it make sense to change the signature size?

Cheers

PS. I slightly changed your code to make more use of (Java) constants:

protected static final int DIMENSIONS = 512;
protected static final int SIGNATURE_SIZE = 128;

private boolean[] byteToBoolArr(byte b) {
    boolean[] boolArr = new boolean[Byte.SIZE];
    for (int i = 0; i < Byte.SIZE; i++) {
        boolArr[i] = (b & (byte) (SIGNATURE_SIZE / Math.pow(2, i))) != 0;
    }
    return boolArr;
}

private boolean[] fromString(String token) {
    if(token.length() / Byte.SIZE > DIMENSIONS) {
        throw new IllegalArgumentException(
                "Input string too large to be encoded. " +
                "Would need at least " + (token.length()*Byte.SIZE) + " dimensions.");
    }
    byte[] bytes = token.getBytes();
    boolean[] result = new boolean[DIMENSIONS];
    for (int c_i = 0, b_i = 0; c_i < bytes.length; c_i++, b_i += Byte.SIZE) {
        System.arraycopy(byteToBoolArr(bytes[c_i]), 0, result, b_i, Byte.SIZE);
    }
    return result;
}

Jurian avatar Jul 08 '22 12:07 Jurian