PdfPig Get captions from Images

Get captions from Images

Open famda opened this issue 2 years ago • 5 comments

Hello,

I'm using this library for a while and let me say: It's awesome.

Is there any way to get image labels from a pdf (maybe using the DocumentLayoutAnalysis)?

Thanks in advance.

Jan 24 '22 11:01 famda

Hi @famda, could you give more details about what you mean by labels?

Jan 24 '22 12:01 BobLd

Sorry about that. :)

So, the idea is to be able to relate the image with a caption on a document.

Something like this:

@BobLd

Jan 24 '22 12:01 famda

@famda thanks for the explaination.

The first warning is that what you're trying to achieve is far from straightforward...

Maybe a starting point would be to have a look here https://github.com/BobLd/DocumentLayoutAnalysis#chart-and-diagram

One tools that addresses your objective seems to be this one (I guess reading the research paper would be useful to you to understand the complexity of the task) http://pdffigures.allenai.org/ | https://github.com/allenai/pdffigures EDIT: they did an update of their library here https://github.com/allenai/pdffigures2

Another option would be to use Machine learning, where you would train you model to spot captions (you would still then have to link images and captions). An example of model doing that (no captions though) is avalaible here https://github.com/BobLd/PublayNet-maskrcnn-mlnet

Jan 24 '22 12:01 BobLd

Thanks for the reply. I do understand that this is a difficult thing to achieve (I'm trying to get this working for some time now xD).

I'll have a look on their tool and check if I can find some solution on C#. ;)

Jan 25 '22 09:01 famda

Caption candidate

Something that might be a good start is to check is a line is a caption candidate. This is how they do it:

// Words that might start captions
private val captionStartRegex = """^(Figure.|Figure|FIGURE|Table|TABLE||Fig.|Fig|FIG.|FIG)$""".r
// Finds caption number that might follow the given word, occasionally this number will be
// incorrectly chunked with the following word if ending with : or '.' so we allow following text
private val captionNumberRegex =
    """^([1-9][0-9]*.[1-9][0-9]*|[1-9][0-9]*|[IVX]+|[1-9I][0-9I]*|[A-D].[1-9][0-9]*)($|:|.)?""".r

https://github.com/allenai/pdffigures2/blob/0fea6750c1527ef53a6749f896c7b90a538803e7/src/main/scala/org/allenai/pdffigures2/CaptionDetector.scala#L106-L112

They then do some filtering, as per their paper:

Filters are only applied if they do not remove all phrases referring to a particular figure. Filters include (I): Select only phrases that end with a period. (II): Select only phrases that end with a semicolon. (III): Select only phrases that have bold font. (IV): Select only phrases that have italic font. (V): Select only phrases that have a different font size than the words that follow them. Filters are iteratively applied until no false positives are left.

We resolve this problem by adding a number of additional filters to the ones used in [5], such as (I): Select phrases that are all caps. (II): Select phrases that are abbreviated. (III): Select phrases that occupy a single line. (IV): Select phrases that do not use the most commonly used font in the document. (V): Select phrases that are left aligned to the text beneath them. The last filter serves as a general purpose filter for detecting indented paragraphs or bullet points that start by mentioning a figure.

Clustering

CF Figure 3 and 4.2.3 Graphical Region Identification in the research paper Regarding the clustering part, I did some work on that a while ago:

/// <summary>
/// Algorithm to group elements for which axis-aligned rectangle representation intersect.
/// </summary>
/// <typeparam name="T">Images, Paths, Letter, Word, TextLine, etc.</typeparam>
/// <param name="elements">Array of elements to group.</param>
/// <param name="elementRectangle">The element's rectangle to use for clustering, e.g. the bounding box.
/// <para>Treated as axis-aligned when checking for intersection.</para></param>
/// <param name="tolerance">The tolerance level to use when checking if two elements intersect.</param>
public static IEnumerable<IReadOnlyList<T>> IntersectAxisAligned<T>(IReadOnlyList<T> elements,
	Func<T, PdfRectangle> elementRectangle, double tolerance = 0)
{
	if (elements.Count == 0)
	{
		return EmptyArray<IReadOnlyList<T>>.Instance;
	}


	bool intersectsWith(PdfRectangle bbox, PdfRectangle other, double tol)
	{
		return !((bbox.TopRight.X < other.BottomLeft.X - tol) || (bbox.BottomLeft.X > other.TopRight.X + tol) ||
				 (bbox.TopRight.Y < other.BottomLeft.Y - tol) || (bbox.BottomLeft.Y > other.TopRight.Y + tol));
	}


	List<(T[], PdfRectangle)> currentBoxes = elements.Zip(elements.Select(x => elementRectangle(x)), (a, b) => (new[] { a }, b.Normalise())).ToList();


	// Adapted from https://github.com/allenai/pdffigures2/blob/master/src/main/scala/org/allenai/pdffigures2/Box.scala
	var foundIntersectingBoxes = true;


	while (foundIntersectingBoxes)
	{
		foundIntersectingBoxes = false;


		// The box we are going to check to see if there are any intersecting boxes,
		// followed by any boxes that we have already check
		var uncheckedS = new Stack<(T[], PdfRectangle)>(currentBoxes);
		var checkedS = new Stack<(T[], PdfRectangle)>(new[] { uncheckedS.Pop() });


		while (!foundIntersectingBoxes && uncheckedS.Count > 0)
		{
			var head = checkedS.Pop();
			var inters = uncheckedS.ToLookup(x => intersectsWith(x.Item2, head.Item2, tolerance));
			var intersects = inters[true].ToList();


			if (intersects.Count > 0)
			{
				intersects.Add(head);
				var newBox = (intersects.SelectMany(x => x.Item1).ToArray(),
							  new PdfRectangle(intersects.Min(b => b.Item2.BottomLeft.X),
											   intersects.Min(b => b.Item2.BottomLeft.Y),
											   intersects.Max(b => b.Item2.TopRight.X),
											   intersects.Max(b => b.Item2.TopRight.Y)));
				currentBoxes = inters[false].ToList(); // nonIntersects
				currentBoxes.Add(newBox);
				currentBoxes.AddRange(checkedS);
				foundIntersectingBoxes = true; // Exit this loop and re-enter the outer loop
			}
			else
			{
				checkedS.Push(head);
				checkedS.Push(uncheckedS.Pop());
			}
		}
	}


	return currentBoxes.Select(x => x.Item1);
}

https://github.com/BobLd/PdfPig/blob/014f7307f0364f1a694800b21a7729600e7ea477/src/UglyToad.PdfPig.DocumentLayoutAnalysis/Clustering.cs#L337-L402

Jan 25 '22 10:01 BobLd

PdfPig PdfPig copied to clipboard

Get captions from Images

Caption candidate

Clustering

PdfPig
PdfPig copied to clipboard