Examples show boolean vectors, what about string vectors?
Hi, I was wondering how to use this library for comparing two different Strings that are tokenized into a string vector each. The examples only show boolean vectors which are just "post-transformation". As a newbie and to make great use of the library, it would be great to have the transformation part covered in the examples.
Excellent point. It would help to have examples converting strings and documents to vectors using something like word2vec but more lightweight.
Something like this but with a fully coded example:
https://link.medium.com/xF9m59LMr1
I use this approach:
import java.nio.ByteBuffer;
import java.util.Base64;
import java.util.stream.IntStream;
import info.debatty.java.lsh.MinHash;
/**
* Hello world!
*/
public class App {
//https://github.com/tdebatty/java-LSH
public static void main(String[] args) {
int signature_size = 128;
int dictionary_size = 256;
int initial_seed = 1234567890;
MinHash mh = new MinHash(signature_size, dictionary_size, initial_seed);
String pass1 = "Nikita";
String pass2 = "Nikika";
String pass3 = "Nikika1995strong";
String pass4 = "qweertyuiop";
var sig1 = mh.signature(fromString(pass1));
var sig2 = mh.signature(fromString(pass2));
var sig3 = mh.signature(fromString(pass3));
var sig4 = mh.signature(fromString(pass4));
System.out.println("sig1-sig2 : " + mh.similarity(sig1, sig2));
System.out.println("sig1-sig3 : " + mh.similarity(sig1, sig3));
System.out.println("sig1-sig4 : " + mh.similarity(sig1, sig4));
println(sig1);
println(sig2);
println(sig3);
println(sig4);
}
public static boolean[] fromString(String str) {
var bytes = str.getBytes();
var result = new boolean[256];
int index = 0;
for (int i = 0; i < str.length(); i++, index += 8) {
var bits = byteToBoolArr(bytes[i]);
System.arraycopy(bits, 0, result, index, bits.length);
}
return result;
}
public static boolean[] byteToBoolArr(byte b) {
var boolArr = new boolean[8];
for (int i = 0; i < 8; i++) {
boolArr[i] = (b & (byte) (128 / Math.pow(2, i))) != 0;
}
return boolArr;
}
static void println(final int[] array) {
System.out.println(convertToBase64(array));
}
public static String convertToBase64(int[] ints) {
ByteBuffer buf = ByteBuffer.allocate(ints.length);
IntStream.of(ints).forEach(i -> buf.put((byte)i));
return Base64.getEncoder().encodeToString(buf.array());
}
}
This is a very interesting method and I would like to use it in my project. At the moment I am using ngrams to create boolean vectors, perhaps this works better and faster.
I do have a few questions/observations:
- You copy 8 bits into the result array for every character in the string, this means the strings have a maximum length?
- I assume the signature size and the number 128 in byteToBoolArr() are the same variable?
- When does it make sense to change the signature size?
Cheers
PS. I slightly changed your code to make more use of (Java) constants:
protected static final int DIMENSIONS = 512;
protected static final int SIGNATURE_SIZE = 128;
private boolean[] byteToBoolArr(byte b) {
boolean[] boolArr = new boolean[Byte.SIZE];
for (int i = 0; i < Byte.SIZE; i++) {
boolArr[i] = (b & (byte) (SIGNATURE_SIZE / Math.pow(2, i))) != 0;
}
return boolArr;
}
private boolean[] fromString(String token) {
if(token.length() / Byte.SIZE > DIMENSIONS) {
throw new IllegalArgumentException(
"Input string too large to be encoded. " +
"Would need at least " + (token.length()*Byte.SIZE) + " dimensions.");
}
byte[] bytes = token.getBytes();
boolean[] result = new boolean[DIMENSIONS];
for (int c_i = 0, b_i = 0; c_i < bytes.length; c_i++, b_i += Byte.SIZE) {
System.arraycopy(byteToBoolArr(bytes[c_i]), 0, result, b_i, Byte.SIZE);
}
return result;
}