【Java】基于 Tabula 的 PDF 合并单元格内容提取

坑还是要填的，但是填得是否平整就有待商榷了（狗头保命...）。

本人技术有限，只能帮各位实现的这个地步了。各路大神如果还有更好的实现也可以发出来跟小弟共勉一下哈。

首先需要说一下的是以下提供的代码仅作研究参考使用，各位在使用之前务必自检，因为并不是所有 pdf 的表格格式都适合。

本次实现的难点在于 PDF 是一种视觉格式，而不是语义格式。

它只记录了“在 (x, y) 坐标绘制文本 'ABC'”和“从 (x1, y1) 到 (x2, y2) 绘制一条线”。它根本不“知道”什么是“表格”、“行”或“合并单元格”。而 Tabula 的 SpreadsheetExtractionAlgorithm 算法是处理这种问题的最佳起点，但它提取的结果会是“不规则”的，即每行的单元格数量可能不同。因此本次将采用后处理的方式进行解析，Tabula 更多的只是作内容提取，表格组织还是在后期处理进行的。

就像上次的文章中说到

【Java】采用 Tabula 技术对 PDF 文件内表格进行数据提取

本次解决问题的核心思路就是通过计算每一个单元格完整的边界框，得到它的 top，left， bottom，right。通过收集所有单元格的 top 坐标和 bottom 坐标，推断出表格中所有“真实”的行边界。同理，通过收集所有单元格的 left 坐标和 right 坐标，可以推断出所有“真实”的列边界。最后基于这些边界构建一个完整的网格，然后将 Tabula 提取的文本块“放”回这个网格中。

为了方便测试我使用了 Deepseek 官网“模型细节”章节里面的那个表格。

这个表格是比较经典的，既有列合并单元格，也有行合并单元格。而且表格中并没有那么多复杂的内容。

下面是我的执行代码

1package cn.paohe;
2
3import java.awt.Point;
4import java.io.BufferedInputStream;
5import java.io.File;
6import java.io.FileInputStream;
7import java.io.IOException;
8import java.io.InputStream;
9import java.util.ArrayList;
10import java.util.HashSet;
11import java.util.List;
12import java.util.NavigableSet;
13import java.util.Set;
14import java.util.TreeSet;
15
16import org.apache.pdfbox.pdmodel.PDDocument;
17
18import technology.tabula.ObjectExtractor;
19import technology.tabula.Page;
20import technology.tabula.PageIterator;
21import technology.tabula.Rectangle;
22import technology.tabula.RectangularTextContainer;
23import technology.tabula.Table;
24import technology.tabula.extractors.SpreadsheetExtractionAlgorithm;
25
26public class MergedCellPdfExtractor {
27
28    // 浮点数比较的容差
29    private static final float COORDINATE_TOLERANCE = 2.0f;
30
31    /**
32	 * 为了样例方便，在类内部直接封装一个单元格，包含其文本和边界
33	 */
34    static class Cell extends Rectangle {
35        public String text;
36        public float top;
37        public float left;
38        public double width;
39        public double height;
40
41        public Cell(float top, float left, double width, double height, String text) {
42            this.top = top;
43            this.left = left;
44            this.width = width;
45            this.height = height;
46            this.text = text == null ? "" : text.trim();
47        }
48
49        @Override
50        public float getTop() {
51            return top;
52        }
53
54        @Override
55        public float getLeft() {
56            return left;
57        }
58
59        @Override
60        public double getWidth() {
61            return width;
62        }
63
64        @Override
65        public double getHeight() {
66            return height;
67        }
68
69        @Override
70        public float getBottom() {
71            return (float) (top + height);
72        }
73
74        @Override
75        public float getRight() {
76            return (float) (left + width);
77        }
78
79        @Override
80        public double getX() {
81            return left;
82        }
83
84        @Override
85        public double getY() {
86            return top;
87        }
88
89        @Override
90        public Point[] getPoints() {
91            return new Point[0];
92        }
93
94        @Override
95        public String toString() {
96            return String.format("Cell[t=%.2f, l=%.2f, w=%.2f, h=%.2f, text='%s']", top, left, width, height, text);
97        }
98    }
99
100    /**
101	 * 由于表格提取的时候会出现偏差，因此定义表格指纹，用于去重
102	 */
103    static class TableFingerprint {
104        private final float top;
105        private final float left;
106        private final int rowCount;
107        private final int colCount;
108        private final String contentHash;
109
110        public TableFingerprint(Table table) {
111            this.top = roundCoordinate(table.getTop());
112            this.left = roundCoordinate(table.getLeft());
113            this.rowCount = table.getRowCount();
114            this.colCount = table.getColCount();
115            this.contentHash = generateContentHash(table);
116        }
117
118        /**
119		 * 生成表格的内容 Hash，用于快速比较两个表格是否相同
120		 * 
121		 * Hash 生成规则：将每个单元格的文本内容连接起来，使用 "|" 分隔 如果单元格的数量超过 10 个，就停止生成 Hash
122		 * 
123		 * @param table 需要生成 Hash 的表格
124		 * @return 生成的 Hash
125		 */
126        private String generateContentHash(Table table) {
127            StringBuilder sb = new StringBuilder();
128            int cellCount = 0;
129            for (List<RectangularTextContainer> row : table.getRows()) {
130                for (RectangularTextContainer cell : row) {
131                    sb.append(cell.getText()).append("|");
132                    cellCount++;
133                    if (cellCount > 10) {
134                        break;
135                    }
136                }
137                if (cellCount > 10) {
138					break;
139				}
140			}
141			return sb.toString();
142		}
143
144		@Override
145		public boolean equals(Object obj) {
146			if (!(obj instanceof TableFingerprint)) {
147				return false;
148			}
149
150			// 将要比较的对象强制转换为 TableFingerprint
151			TableFingerprint other = (TableFingerprint) obj;
152
153			// 两个表格的 top 和 left 坐差不能超过 COORDINATE_TOLERANCE
154			boolean topMatch = Math.abs(this.top - other.top) < COORDINATE_TOLERANCE;
155			boolean leftMatch = Math.abs(this.left - other.left) < COORDINATE_TOLERANCE;
156
157			// 两个表格的行数和列数必须相同
158			boolean rowMatch = this.rowCount == other.rowCount;
159			boolean colMatch = this.colCount == other.colCount;
160
161			// 两个表格的内容 Hash must be equal
162			boolean contentMatch = this.contentHash.equals(other.contentHash);
163
164			// 如果以上条件都满足，则返回 true
165			return topMatch && leftMatch && rowMatch && colMatch && contentMatch;
166		}
167
168		@Override
169		public int hashCode() {
170			return contentHash.hashCode();
171		}
172
173		/**
174		 * 将坐标四舍五入到指定精度，减少浮点误差
175		 * 
176		 * @param coord 需要四舍五入的坐标
177		 * @return 四舍五入后的坐标
178		 */
179		private static float roundCoordinate(float coord) {
180			// 将坐标乘以 10，然后将结果四舍五入，然后除以 10.0f，保留一个小数点
181			return Math.round(coord * 10) / 10.0f;
182		}
183	}
184
185	/**
186	 * 解析 PDF 文件中的所有表格
187	 * 
188	 * 1. 使用 ObjectExtractor 将 PDF 文件中的所有表格进行提取 2. 使用 SpreadsheetExtractionAlgorithm
189	 * 基于线条检测表格，避免重复表格 3. 规范化表格，处理合并单元格
190	 * 
191	 * @param pdfFile 要解析的 PDF 文件
192	 * @return 规范化的表格数据，每个 List<List<String>> 代表一个表格
193	 * @throws IOException 文件读取异常
194	 */
195	public List<List<List<String>>> parseTables(File pdfFile) throws IOException {
196		List<List<List<String>>> allNormalizedTables = new ArrayList<>();
197		Set<TableFingerprint> seenTables = new HashSet<>();
198
199		InputStream bufferedStream = new BufferedInputStream(new FileInputStream(pdfFile));
200		try (PDDocument pdDocument = PDDocument.load(bufferedStream)) {
201			ObjectExtractor oe = new ObjectExtractor(pdDocument);
202			PageIterator pi = oe.extract();
203
204			while (pi.hasNext()) {
205				Page page = pi.next();
206
207				// 使用 SpreadsheetExtractionAlgorithm 基于线条检测表格
208				SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
209				List<Table> tables = sea.extract(page);
210
211				for (Table table : tables) {
212					// 去重检查
213					TableFingerprint fingerprint = new TableFingerprint(table);
214					if (seenTables.contains(fingerprint)) {
215						System.out.println("跳过重复表格: top=" + fingerprint.top + ", left=" + fingerprint.left);
216						continue;
217					}
218					seenTables.add(fingerprint);
219
220					List<List<String>> normalized = normalizeTable(table);
221					if (!normalized.isEmpty()) {
222						allNormalizedTables.add(normalized);
223					}
224				}
225			}
226		}
227		return allNormalizedTables;
228	}
229
230	/**
231	 * 规范化表格，处理合并单元格
232	 * 
233	 * @param table Tabula 提取的原始表格
234	 * @return 规范化的 List<List<String>>
235	 */
236	private List<List<String>> normalizeTable(Table table) {
237		// 1. 提取所有单元格及其坐标
238		List<Cell> allCells = new ArrayList<>();
239		for (List<RectangularTextContainer> row : table.getRows()) {
240			for (RectangularTextContainer tc : row) {
241				allCells.add(new Cell(tc.getTop(), tc.getLeft(), tc.getWidth(), tc.getHeight(), tc.getText()));
242			}
243		}
244
245		if (allCells.isEmpty()) {
246			return new ArrayList<>();
247		}
248
249		// 2. 收集所有唯一的行起始位置和列起始位置，并添加结束位置
250		NavigableSet<Float> rowBoundaries = new TreeSet<>();
251		NavigableSet<Float> colBoundaries = new TreeSet<>();
252
253		for (Cell cell : allCells) {
254			rowBoundaries.add(roundCoordinate(cell.getTop()));
255			rowBoundaries.add(roundCoordinate(cell.getBottom()));
256			colBoundaries.add(roundCoordinate(cell.getLeft()));
257			colBoundaries.add(roundCoordinate(cell.getRight()));
258		}
259
260		// 3. 转换为列表并去除首尾（表格外边界）
261		List<Float> rowCoords = new ArrayList<>(rowBoundaries);
262		List<Float> colCoords = new ArrayList<>(colBoundaries);
263
264		// 移除最小和最大值（表格外边界），只保留内部网格线
265		if (rowCoords.size() > 2) {
266			rowCoords.remove(rowCoords.size() - 1); // 移除最大值（底边）
267			rowCoords.remove(0); // 移除最小值（顶边）
268		}
269		if (colCoords.size() > 2) {
270			colCoords.remove(colCoords.size() - 1); // 移除最大值（右边）
271			colCoords.remove(0); // 移除最小值（左边）
272		}
273
274		// 4. 验证网格有效性
275		if (rowCoords.isEmpty() || colCoords.isEmpty()) {
276			return tableToListOfListOfStrings(table);
277		}
278
279		int numRows = rowCoords.size();
280		int numCols = colCoords.size();
281		String[][] grid = new String[numRows][numCols];
282
283		// 初始化所有单元格为 null
284		for (int r = 0; r < numRows; r++) {
285			for (int c = 0; c < numCols; c++) {
286				grid[r][c] = null;
287			}
288		}
289
290		// 5. 将单元格内容填充到网格中
291		for (Cell cell : allCells) {
292			// 找到单元格在网格中的起始索引
293			int startRow = findCellStartIndex(rowCoords, cell.getTop());
294			int startCol = findCellStartIndex(colCoords, cell.getLeft());
295
296			// 容错处理
297			if (startRow == -1 || startCol == -1) {
298				continue;
299			}
300
301			// 确保索引有效
302			if (startRow >= numRows || startCol >= numCols) {
303				continue;
304			}
305
306			// 计算单元格跨越的行数和列数
307			int endRow = findCellEndIndex(rowCoords, cell.getBottom());
308			int endCol = findCellEndIndex(colCoords, cell.getRight());
309
310			if (endRow == -1)
311				endRow = numRows - 1;
312			if (endCol == -1)
313				endCol = numCols - 1;
314
315			// 将文本放置在左上角单元格
316			if (grid[startRow][startCol] == null) {
317				grid[startRow][startCol] = cell.text;
318			} else {
319				// 如果已有内容，追加（处理重叠情况）
320				if (!grid[startRow][startCol].isEmpty() && !cell.text.isEmpty()) {
321					grid[startRow][startCol] += " " + cell.text;
322				} else if (!cell.text.isEmpty()) {
323					grid[startRow][startCol] = cell.text;
324				}
325			}
326
327			// 标记被合并覆盖的其他单元格
328			for (int r = startRow; r <= endRow && r < numRows; r++) {
329				for (int c = startCol; c <= endCol && c < numCols; c++) {
330					if (r == startRow && c == startCol) {
331						continue; // 跳过左上角已填充的单元格
332					}
333					if (grid[r][c] == null) {
334						grid[r][c] = ""; // 标记为空字符串（合并单元格的一部分）
335					}
336				}
337			}
338		}
339
340		// 6. 填充空单元格：优先从左侧填充，左侧为空则从上方填充
341		for (int r = 0; r < numRows; r++) {
342			for (int c = 0; c < numCols; c++) {
343				if (grid[r][c] == null || grid[r][c].isEmpty()) {
344					String fillContent = null;
345
346					// 优先从左侧获取内容
347					if (c > 0 && grid[r][c - 1] != null && !grid[r][c - 1].isEmpty()) {
348						fillContent = grid[r][c - 1];
349					}
350					// 左侧为空或不存在，从上方获取内容
351					else if (r > 0 && grid[r - 1][c] != null && !grid[r - 1][c].isEmpty()) {
352						fillContent = grid[r - 1][c];
353					}
354
355					if (fillContent != null) {
356						grid[r][c] = fillContent;
357					} else if (grid[r][c] == null) {
358						grid[r][c] = "";
359					}
360				}
361			}
362		}
363
364		// 7. 将二维数组转换为 List<List<String>>
365		List<List<String>> normalizedTable = new ArrayList<>();
366		for (int r = 0; r < numRows; r++) {
367			List<String> normalizedRow = new ArrayList<>();
368			for (int c = 0; c < numCols; c++) {
369				normalizedRow.add(grid[r][c] == null ? "" : grid[r][c]);
370			}
371			normalizedTable.add(normalizedRow);
372		}
373
374		return normalizedTable;
375	}
376
377	/**
378	 * 将坐标四舍五入到指定精度，减少浮点误差
379	 */
380	private float roundCoordinate(float coord) {
381		return Math.round(coord * 10) / 10.0f;
382	}
383
384	/**
385	 * 查找单元格起始位置在网格中的索引
386	 */
387	private int findCellStartIndex(List<Float> coords, float value) {
388		float roundedValue = roundCoordinate(value);
389
390		for (int i = 0; i < coords.size(); i++) {
391			// 单元格的起始位置应该在某个网格线上或之前
392			if (roundedValue <= coords.get(i) + COORDINATE_TOLERANCE) {
393				return i;
394			}
395		}
396
397		return coords.size() - 1;
398	}
399
400	/**
401	 * 查找单元格结束位置在网格中的索引
402	 */
403	private int findCellEndIndex(List<Float> coords, float value) {
404		float roundedValue = roundCoordinate(value);
405
406		for (int i = coords.size() - 1; i >= 0; i--) {
407			// 单元格的结束位置应该在某个网格线上或之后
408			if (roundedValue >= coords.get(i) - COORDINATE_TOLERANCE) {
409				return i;
410			}
411		}
412
413		return 0;
414	}
415
416	/**
417	 * 将 Tabula 的 Table 对象转换为 List<List<String>>>
418	 * 
419	 * @param table Tabula 的 Table 对象
420	 * @return List<List<String>>>
421	 */
422	public List<List<String>> tableToListOfListOfStrings(Table table) {
423		// 创建一个列表来存储表格内容
424		List<List<String>> list = new ArrayList<>();
425
426		// 遍代表格中的每一行
427		for (List<RectangularTextContainer> row : table.getRows()) {
428			// 创建一个列表来存储当前行的内容
429			List<String> rowList = new ArrayList<>();
430
431			// 遍代当前行中的每一个单元格
432			for (RectangularTextContainer tc : row) {
433				// 将当前单元格的内容添加到行列表中
434				String cellText = tc.getText() == null ? "" : tc.getText().trim();
435				rowList.add(cellText);
436				rowList.add(tc.getText() == null ? "" : tc.getText().trim());
437			}
438
439			// 将行列表添加到表格列表中
440			list.add(rowList);
441		}
442		return list;
443	}
444
445	public static void main(String[] args) {
446		// 请替换为你的 PDF 文件路径
447		String pdfPath = "/Users/yuanzhenhui/Desktop/测试用合并单元格解析.pdf";
448
449		File pdfFile = new File(pdfPath);
450		if (!pdfFile.exists()) {
451			System.err.println("错误: 测试文件未找到: " + pdfPath);
452			System.err.println("请在 main 方法中替换为你本地的 PDF 文件路径。");
453			return;
454		}
455
456		MergedCellPdfExtractor extractor = new MergedCellPdfExtractor();
457		try {
458			System.out.println("开始解析: " + pdfPath);
459			List<List<List<String>>> tables = extractor.parseTables(pdfFile);
460
461			System.out.println("解析完成，共找到 " + tables.size() + " 个表格。");
462			System.out.println("========================================");
463
464			int tableNum = 1;
465			for (List<List<String>> table : tables) {
466				System.out.println("\n表格 " + (tableNum++) + ":");
467				System.out.println("行数: " + table.size() + ", 列数: " + (table.isEmpty() ? 0 : table.get(0).size()));
468				System.out.println("----------------------------------------");
469
470				for (List<String> row : table) {
471					System.out.print("|");
472					for (String cell : row) {
473						String cellText = cell.replace("\n", " ").replace("\r", " ");
474						if (cellText.length() > 15) {
475							cellText = cellText.substring(0, 12) + "...";
476						}
477						System.out.print(String.format(" %-15s |", cellText));
478					}
479					System.out.println();
480				}
481				System.out.println("----------------------------------------");
482			}
483
484		} catch (IOException e) {
485			System.err.println("解析 PDF 时出错: " + e.getMessage());
486			e.printStackTrace();
487		}
488	}
489}
490

关于代码的解释应该都清楚的了，由于只是用作试验我就没有很精细地封装了，大家凑合着用吧。如果面对更加复杂的表格的话我建议还是不要用这种填充的方式了，直接上大厂的 OCR 接口吧。

哦，还有东西忘了说了，关于 Maven 依赖的引入如下：

1<dependency>
2  <groupId>technology.tabula</groupId>
3  <artifactId>tabula</artifactId>
4  <version>1.0.5</version>
5</dependency>
6<dependency>
7  <groupId>org.apache.pdfbox</groupId>
8  <artifactId>pdfbox</artifactId>
9  <version>2.0.35</version>
10</dependency>
11

好了，该填的可能填好了。接下来的分享将继续回归到人工智能和区块链当中，欢迎您继续关注我的博客。

《【Java】基于 Tabula 的 PDF 合并单元格内容提取》是转载文章，点击查看原文。