java判断文件编码格式（支持zip）

java判断⽂件编码格式（⽀持zip）java 判断⽂件编码格式(⽀持zip)

前⾔：

最近在⼯作过程中遇到了这样的问题：通过⽂件上传，需要导⼊zip包中的⽂件信息。

由于使⽤的是apache的ant.jar中的ZipFile类、ZipEntry类。由于⽬前该⼯具类并不能判断zip中每个⽂件的具体的编码，

导致解析时出现中⽂乱码。通过查资料发现借鉴使⽤第三⽅⼯具cpDetector解决。因此在此做个记录。

若想实现更复杂的⽂件编码检测，可以使⽤⼀个开源项⽬cpdetector，

⽹址: cpdetector.sourceforge

姑娘爱情郎慕容晓晓它的类库很⼩，只有500K左右，cpDetector是基于统计学原理的，不保证完全正确，利⽤该类库判定⽂本⽂件的代码如下：刘璐个人资料年龄1997

准备条件

需要的jar包：cpdetector_1.0.10.jar、antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar

- 源码：cpdetector_1.0.10_binary.zip

- 相关资料：wwwblogs/king1302217/p/4003060.html

具体实现

在此摸索过程中遇到的问题：查了⽹上的参考例⼦，但是⼏乎所有的都是直接处理针对File对象的处理。

没有针对zip⽂件的相关处理逻辑。并且由于apache的ZipFile 、以及它内部的⽂件对象ZipEntry不能使⽤url⽅式。

于是查看底层实现代码发现可以⽤此：

**charset = detector.detectCodepage(bis, Integer.MAX_VALUE);// zip 判断的关键代码**

注意：

直接使⽤InputStream(zipEntry) 得到的inputStream流不⽀持mark()⽅法。

但是cpdetector底层需要⽤此⽅法.后来查发现底层其实有类似场景的特殊处理：

若是不⽀持mark()则可以把inputStream包装成⽀持的BufferedInputStream即可。如下：

具体代码如下：

你怎么舍得我难过mvimport java.io.BufferedInputStream;

import java.io.File;

import java.io.IOException;

import java.io.InputStream;

import java.URL;

import java.nio.charset.Charset;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

itorenter.cpdetector.io.ASCIIDetector;

itorenter.cpdetector.io.CodepageDetectorProxy;

itorenter.cpdetector.io.JChardetFacade;

itorenter.cpdetector.io.ParsingDetector;

itorenter.cpdetector.io.UnicodeDetector;

/**

* 1、cpDetector内置了⼀些常⽤的探测实现类，这些探测实现类的实例可以通过add⽅法加进来, * ParsingDetector、 JChardetFacade、ASCIIDetector、UnicodeDetector.

* 2、detector按照“谁最先返回⾮空的探测结果，就以该结果为准”的原则.

* 3、cpDetector是基于统计学原理的，不保证完全正确.

public class FileCharsetDetector {

private static final Logger logger = Logger(FileCharsetDetector.class);

/**

* 利⽤第三⽅开源包cpdetector获取⽂件编码格式.

花雨夜*

* @param is

* InputStream 输⼊流

* @return

public static String getFileEncode(InputStream is) {

// begin 此段为zip格式⽂件的处理关键

BufferedInputStream bis = null;

if (is instanceof BufferedInputStream) {

bis = (BufferedInputStream) is;

} else {

bis = new BufferedInputStream(is);

}

// end

CodepageDetectorProxy detector = Instance();

detector.add(new ParsingDetector(false));

detector.Instance());

detector.Instance());// 内部引⽤了 chardet.jar的类

detector.Instance());

Charset charset = null;

try {

charset = detector.detectCodepage(bis, Integer.MAX_VALUE);// zip 判断的关键代码 } catch (Exception e) {

<(e.getMessage(), e);

} finally {

if (bis != null) {

try {

bis.close();

} catch (IOException e) {

<(e.getMessage(), e);

}

// 默认为GBK

String charsetName = "GBK";

if (charset != null) {

if (charset.name().equals("US-ASCII")) {

charsetName = "ISO_8859_1";

} else {

charsetName = charset.name();

}

把悲伤留给自己原唱

return charsetName;

}

public static String getFileEncode(File file) {

CodepageDetectorProxy detector = Instance();

detector.add(new ParsingDetector(false));

detector.Instance());

detector.Instance());你是我胸口永远的痛

Charset charset = null;

try {

charset = detector.URI().toURL());

} catch (Exception e) {

<(e.getMessage(), e);

}

// 默认为GBK

String charsetName = "GBK";

if (charset != null) {

if (charset.name().equals("US-ASCII")) {

charsetName = "ISO_8859_1";

} else {

charsetName = charset.name();

}

return charsetName; }

}

java判断文件编码格式（支持zip）

发布评论取消回复

最近发表

热门文章

标签列表