bioformats icon indicating copy to clipboard operation
bioformats copied to clipboard

Metamorph reader reads .nd files using wrong encoding

Open anntzer opened this issue 2 years ago • 3 comments

Currently, the metamorph nd reader reads nd files as utf-8 (essentially, in MetamorphReader.java, initFile() calls DataTools.readFile(ndFileName); readFile is implemented in ome-common-java/.../DataTools.java and uses Constants.ENCODING = "UTF-8" from ome-common-java/.../Constants.java.

But generating MDA acquisitions on Metamorph (v7.10.5.476 on Windows) with non-ascii stage position names shows that that encoding is incorrect: I believe that the correct encoding is actually Windows-1252, though I cannot find documentation that explicitly states that. (In particular, € is encoded at 0x80, which distinguishes the encoding from latin-1, and trying to input some non-Windows-1252 characters in position names results in these being replaced by question marks.) In practice, such characters currently end up being replaced in bioformats by a Unicode replacement character (U+FFFD �).

I would therefore suggest fixing the encoding, e.g. by providing a DataTools.readFile(String id, Charset charset) overload and using readFile(ndFilename, "windows-1252") in MetamorphReader.java.

anntzer avatar Jun 15 '23 15:06 anntzer

Hi @anntzer, thank you for raising the issue and investigating the problem. Making DataTools.readFile more flexible to also allow for the charset to be input seems like a good solution for this, it would allow for the MetaMorph and in future other readers to have more control of the reading. Do you think you would be able to open a Pull Request with that change? If you need any help or pointers I am happy to assist.

Also do you have a sample file you could share with the Windows-1252 encoding that we could use for testing? If you need a suitable upload location for a sample file, we recommend using https://zenodo.org/

dgault avatar Jun 16 '23 15:06 dgault

Here is a test nd file (zipped because github allows zip uploads but not nd uploads...) test.nd.zip The Stage1 position name should read "µ€œ?" (actually "µ€œẞ" but Metamorph immediately replaces the non-Windows1252 "ẞ" by at "?" at the GUI-level.)

I am posting a tentative patch below, but having no experience with java or in particular with java build tools I don't think it would be particularly efficient for me to actually post a PR (especially given that this looks like it'll need two PRs, one to ome-common-java and one to bioformats).

to ome-common-java

diff --git i/src/main/java/loci/common/DataTools.java w/src/main/java/loci/common/DataTools.java
index a755d37..e8fae24 100644
--- i/src/main/java/loci/common/DataTools.java
+++ w/src/main/java/loci/common/DataTools.java
@@ -74,11 +74,12 @@ public final class DataTools {
    * @param id name of the file to read
    *           this can be any name supported by Location,
    *           not necessarily a file on disk
+   * @param charsetName the file encoding
    * @return the complete contents of the specified file
    * @throws IOException if the file cannot be read or is larger than 2GB
    * @see Location#getMappedId(String)
    */
-  public static String readFile(String id) throws IOException {
+  public static String readFile(String id, String charsetName) throws IOException {
     RandomAccessInputStream in = new RandomAccessInputStream(id);
     long idLen = in.length();
     if (idLen > Integer.MAX_VALUE) {
@@ -88,11 +89,25 @@ public final class DataTools {
     byte[] b = new byte[len];
     in.readFully(b);
     in.close();
-    String data = new String(b, Constants.ENCODING);
+    String data = new String(b, charsetName);
     b = null;
     return data;
   }

+  /**
+   * Reads the contents of the given file into a string.
+   *
+   * @param id name of the file to read
+   *           this can be any name supported by Location,
+   *           not necessarily a file on disk
+   * @return the complete contents of the specified file
+   * @throws IOException if the file cannot be read or is larger than 2GB
+   * @see Location#getMappedId(String)
+   */
+  public static String readFile(String id) throws IOException {
+    return readFile(id, Constants.ENCODING);
+  }
+
   // -- Word decoding - bytes to primitive types --

   /**

to bioformats

diff --git i/components/formats-gpl/src/loci/formats/in/MetamorphReader.java w/components/formats-gpl/src/loci/formats/in/MetamorphReader.java
index 82b27902d0..1a2630fe31 100644
--- i/components/formats-gpl/src/loci/formats/in/MetamorphReader.java
+++ w/components/formats-gpl/src/loci/formats/in/MetamorphReader.java
@@ -484,7 +484,7 @@ public class MetamorphReader extends BaseTiffReader {
       boolean useWaveNames = true;

       ndFilename = ndfile.getAbsolutePath();
-      String[] lines = DataTools.readFile(ndFilename).split("\n");
+      String[] lines = DataTools.readFile(ndFilename, "windows-1252").split("\n");

       boolean globalDoZ = true;
       boolean doTimelapse = false;
@@ -2265,7 +2265,7 @@ public class MetamorphReader extends BaseTiffReader {
    */
   private String getNDVersionSuffix(Location ndfile) throws IOException {
     ndFilename = ndfile.getAbsolutePath();
-    String[] lines = DataTools.readFile(ndFilename).split("\n");
+    String[] lines = DataTools.readFile(ndFilename, "windows-1252").split("\n");
     boolean globalDoZ = true;
     String version = NDINFOFILE_VER1;
     StringBuilder currentValue = new StringBuilder();

anntzer avatar Jun 16 '23 16:06 anntzer

Thanks @anntzer, I will try to get those opened as PR's in the next few days

dgault avatar Jun 19 '23 10:06 dgault