mammoth.js icon indicating copy to clipboard operation
mammoth.js copied to clipboard

Bug : uncompressed data size mismatch

Open Safee-ullah1 opened this issue 1 year ago • 10 comments

I keep getting this error no matter which docx file I use.

it started from langchain, then I tried to use mammoth directly but same issue:

async function extractRawTextFromDocx(arrayBuffer: ArrayBuffer) {
  try {
    const result = await extractRawText({ arrayBuffer });
    return result.value; // Extracted text
  } catch (error) {
    if (error instanceof Error) {
      console.error("Error extracting text from DOCX:", error);
      throw new Error(`Failed to extract text from DOCX: ${error.message}`);
    }
  }
}

Safee-ullah1 avatar Sep 07 '24 07:09 Safee-ullah1

I'm afraid I can't help without some additional information:

  • Could you provide a minimal example document?
  • Could you provide a minimal, complete program that reproduces the problem?
  • What environment are you calling from?
  • If you use the CLI of mammoth.js on node.js, do you get the same problem?
  • What's the full error you get?

mwilliamson avatar Sep 07 '24 09:09 mwilliamson

I'm experiencing the same error. Here are the details you requested:

1. Minimal example document

Here's a simple DOCX example: TestDoc.docx

2. Minimal program

Here's a Next.js API route I created for testing:

import { NextResponse } from 'next/server';
import { createClient } from '@supabase/supabase-js';
import mammoth from 'mammoth';

const supabaseUrl = process.env.NEXT_PUBLIC_SUPABASE_URL!;
const supabaseServiceRoleKey = process.env.SUPABASE_SERVICE_ROLE_KEY!;

const supabase = createClient(supabaseUrl, supabaseServiceRoleKey);

// Get mammoth version
const mammothVersion = require('mammoth/package.json').version;

export async function GET() {
  // File details
  const bucketName = 'documents';
  const filePath = '5deaa894-2094-4da3-b4fd-1fada0809d1c/1725842453438_eca8ffb8/TestDoc.docx'; 

  try {
    console.log('Supabase URL:', supabaseUrl);
    console.log('Attempting to download file:', filePath);
    console.log('Mammoth version:', mammothVersion);

    // Download the file from Supabase storage
    const { data, error } = await supabase.storage
      .from(bucketName)
      .download(filePath);

    if (error) {
      throw new Error(`Failed to download file: ${error.message}`);
    }

    console.log('File downloaded successfully. Size:', data.size);
    console.log('File type:', data.type);

    // Convert the file to an ArrayBuffer
    const arrayBuffer = await data.arrayBuffer();
    console.log('ArrayBuffer length:', arrayBuffer.byteLength);
    console.log('First 20 bytes:', new Uint8Array(arrayBuffer.slice(0, 20)));

    console.log('Extracting raw text...');
    // Convert ArrayBuffer to Buffer
    const buffer = Buffer.from(arrayBuffer);

    // Use mammoth to convert to markdown
    const result = await mammoth.extractRawText({ buffer: buffer });

    console.log('Extraction successful.');
    
    return NextResponse.json({
      success: true,
      mammothVersion: mammothVersion,
      rawText: result.value.substring(0, 500), // First 500 characters
      warnings: result.messages,
      fileInfo: {
        size: data.size,
        type: data.type,
        arrayBufferLength: arrayBuffer.byteLength,
        firstBytes: Array.from(new Uint8Array(arrayBuffer.slice(0, 20))),
      },
    });
  } catch (error) {
    console.error('Error:', error);
    return NextResponse.json({ 
      success: false, 
      mammothVersion: mammothVersion,
      error: error.message,
      supabaseUrl: supabaseUrl,
      filePath: filePath,
    }, { status: 500 });
  }
}

Output from this program in browser:

{
  "success": false,
  "mammothVersion": "1.8.0",
  "error": "Bug : uncompressed data size mismatch",
  "supabaseUrl": "http://127.0.0.1:54321",
  "filePath": "5deaa894-2094-4da3-b4fd-1fada0809d1c/1725842453438_eca8ffb8/TestDoc.docx"
}

3. Environment

Next.js app running on Node.js v16.14.2 (server-side)

4. CLI test

I haven't tested with the CLI as I'm running this in a Next.js environment.

5. Console logs when trying above test

web:dev: Supabase URL: http://127.0.0.1:54321
web:dev: Attempting to download file: 5deaa894-2094-4da3-b4fd-1fada0809d1c/1725842453438_eca8ffb8/TestDoc.docx
web:dev: Mammoth version: 1.8.0
web:dev: File downloaded successfully. Size: 13133
web:dev: File type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
web:dev: ArrayBuffer length: 13133
web:dev: First 20 bytes: Uint8Array(20) [
web:dev:    80,  75,   3,   4, 20,  0, 6,
web:dev:     0,   8,   0,   0,  0, 33, 0,
web:dev:   223, 164, 210, 108, 90,  1
web:dev: ]
web:dev: Extracting raw text...
web:dev: Error: Error: Bug : uncompressed data size mismatch
web:dev:     at DataLengthProbe.<anonymous> (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:1631:23)
web:dev:     at DataLengthProbe.emit (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:751:42)
web:dev:     at DataLengthProbe.end (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:698:18)
web:dev:     at FlateWorker.<anonymous> (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:783:18)
web:dev:     at FlateWorker.emit (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:751:42)
web:dev:     at FlateWorker.end (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:698:18)
web:dev:     at DataWorker.<anonymous> (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:783:18)
web:dev:     at DataWorker.emit (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:751:42)
web:dev:     at DataWorker.end (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:698:18)
web:dev:     at module.exports.[project]/node_modules/.pnpm/[email protected]/node_modules/jszip/lib/stream/DataWorker.js [app-route] (ecmascript).DataWorker._tick (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:1464:21)
web:dev:     at module.exports.[project]/node_modules/.pnpm/[email protected]/node_modules/jszip/lib/stream/DataWorker.js [app-route] (ecmascript).DataWorker._tickAndRepeat (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:1448:10)
web:dev:     at Immediate.<anonymous> (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f2b73_jszip_lib_d6231f._.js:558:18)
web:dev: From previous event:
web:dev:     at Promise.longStackTracesCaptureStackTrace [as _captureStackTrace] (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/b3bc1_bluebird_js_release_c4a240._.js:1447:23)
web:dev:     at Promise._then (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/b3bc1_bluebird_js_release_c4a240._.js:4352:21)
web:dev:     at Promise.then (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/b3bc1_bluebird_js_release_c4a240._.js:4267:21)
web:dev:     at Object.extractRawText (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/f85dd_mammoth_8f32fe._.js:3628:33)
web:dev:     at GET (/Users/johnhughes/Projects (Git)/codex/apps/web/.next/server/chunks/app_api_test_route_ts_5314dd._.js:44:231)
web:dev:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
web:dev:     at async /Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-route.runtime.dev.js:6:55038
web:dev:     at async ek.execute (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-route.runtime.dev.js:6:45808)
web:dev:     at async ek.handle (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-route.runtime.dev.js:6:56292)
web:dev:     at async doRender (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:1357:42)
web:dev:     at async cacheEntry.responseCache.get.routeKind (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:1579:28)
web:dev:     at async DevServer.renderToResponseWithComponentsImpl (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:1487:28)
web:dev:     at async DevServer.renderPageComponent (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:1911:24)
web:dev:     at async DevServer.renderToResponseImpl (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:1949:32)
web:dev:     at async DevServer.pipeImpl (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:916:25)
web:dev:     at async NextNodeServer.handleCatchallRenderRequest (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/next-server.js:272:17)
web:dev:     at async DevServer.handleRequestImpl (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/base-server.js:812:17)
web:dev:     at async /Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/dev/next-dev-server.js:339:20
web:dev:     at async Span.traceAsyncFn (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/trace/trace.js:154:20)
web:dev:     at async DevServer.handleRequest (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/dev/next-dev-server.js:336:24)
web:dev:     at async invokeRender (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/lib/router-server.js:173:21)
web:dev:     at async handleRequest (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/lib/router-server.js:350:24)
web:dev:     at async requestHandlerImpl (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/lib/router-server.js:374:13)
web:dev:     at async Server.requestListener (/Users/johnhughes/Projects (Git)/codex/node_modules/.pnpm/[email protected]_@[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/server/lib/start-server.js:141:13)
web:dev:  GET /api/test 500 in 253ms

Thank you for any help you can provide.

jj3ny avatar Sep 09 '24 01:09 jj3ny

Yeah, that's true, I'm sorry for the incomplete question.

My code is pretty much exactly the same. When running mammoth through the cli, with pnpx mammoth file-sample_100kB.docx1 the output is okay.

Safee-ullah1 avatar Sep 09 '24 02:09 Safee-ullah1

4. CLI test

I haven't tested with the CLI as I'm running this in a Next.js environment.

Could you try using the CLI? This will help to determine if the issue is with Mammoth itself or the way you're using it.

My code is pretty much exactly the same. When running mammoth through the cli, with pnpx mammoth file-sample_100kB.docx1 the output is okay.

That suggests that the issue is that the array buffer you're passing to Mammoth is incorrect, so I'd suggest verifying that it has the correct contents. I'm afraid I can't provide any help without further details.

mwilliamson avatar Sep 09 '24 09:09 mwilliamson

const processFile = async (url: string): Promise<ArrayBuffer> => {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error("Network response was not ok");
  }
  const arrayBuffer = await response.arrayBuffer();
  const buffer = Buffer.from(arrayBuffer);
  return await mammoth.extractRawText({ buffer });
};

Seems like this code, when run inside a vercel serverless function produces this error. I was getting this exact error on langchain and tried to use mammoth manually to try and fix it but ran into the same error. Another common factor is both us are downloading the file over the internet from bucket storage. I'm using S3 to store the files.

Safee-ullah1 avatar Sep 10 '24 00:09 Safee-ullah1

I have the same problem.

This gives the error:

"use client";
import { convertToHtml } from "mammoth"; // Only this changes
...

const html = (await convertToHtml({ arrayBuffer: buffer }, options)).value;

This works:

"use client";
import { convertToHtml } from "mammoth/mammoth.browser"; // Only this changes
...

const html = (await convertToHtml({ arrayBuffer: buffer }, options)).value;

I use Next.js 14 and the code runs in the browser.

3dteemu avatar Sep 11 '24 13:09 3dteemu

@3dteemu Thanks a ton for that tip. For my use case switching to client-side processing (rather than server-side, as I had been doing initially) using the changed import statement you suggested works well, so that fixed the issue. Many thanks!!

jj3ny avatar Sep 11 '24 18:09 jj3ny

I'm afraid without a minimal, complete example of how to reproduce the problem, there's not much I can do to investigate. So far as I can tell (and apologies if I've missed it), all of the examples posted so far are fragments of larger programs.

If you want to investigate yourselves (which is likely to be the best route for environments such as next.js and vercel that I'm unfamiliar with), then Mammoth is just passing the buffer to JSZip.loadAsync(), and then reading the entries using zipFile.file(name).async("uint8array") method. Seeing if you can reproduce and then investigate the problem using JSZip directly is probably a good place to start.

mwilliamson avatar Sep 11 '24 19:09 mwilliamson

I was about to create a minimal example that would reproduce this issue. While doing that I found out that (at least for me) the problem is with Turbopack.

This does not work:

next dev --turbo

This works:

next dev

Without Turbopack it is possible to import mammoth with

import mammoth from 'mammoth';

without using the browser version and everything works.

Since Turbopack is still in beta in Next.js, and I don't have the skills to debug it, I'll just use Webpack. For me this is solved for now.

3dteemu avatar Sep 12 '24 11:09 3dteemu

This does not work:

next dev --turbo

This works:

next dev

Thanks so much!

lesenelir avatar Sep 12 '24 15:09 lesenelir

Sounds like this is potentially a (fixed) bug in Turbopack, and there aren't any reproduction steps for this issue, so I don't have a way of investigating further, so I'm closing this.

mwilliamson avatar Dec 23 '24 07:12 mwilliamson