對這文章發表回應

發表限制: 非會員 可以發表

發表者：冷日發表時間： 2008/3/18 4:19:29

[轉貼]正則運算式 [精華]
--------------------------------------------------------------------------------
第一部分：
-----------------
正則運算式(REs)通常被錯誤地認?是只有少數人理解的一種神秘語言。在表面上它們確實看起來雜亂無章，如果你不知道它的語法，那?它的代碼在你眼?只是一堆文字垃圾而已。實際上，正則運算式是非常簡單並且可以被理解。讀完這篇文章後，你將會通曉正則運算式的通用語法。
支援多種平臺
正則運算式最早是由數學家Stephen Kleene于1956年提出，他是在對自然語言的遞增研究成果的基礎上提出來的。具有完整語法的正則運算式使用在字元的格式比對方面上，後來被應用到資訊技術領域。自從那時起，正則運算式經過幾個時期的發展，現在的標準已經被ISO(國際標準組織)批准和被Open Group組織認定。
正則運算式並非一門專用語言，但它可用於在一個文件或字元?查找和替代文本的一種標準。它具有兩種標準：基本的正則運算式(BRE)，擴展的正則運算式(ERE)。ERE包括BRE功能和另外其他的概念。
許多程式中都使用了正則運算式，包括xsh,egrep,sed,vi以及在UNIX平臺下的程式。它們可以被很多語言採納，如HTML和XML，這些採納通常只是整個標準的一個子集。
比你想象的還要普通
隨著正則運算式移植到交叉平臺的程式語言的發展，這的功能也日益完整，使用也逐漸廣泛。網路上的搜索引擎使用它，e-mail程式也使用它，即使你不是一個UNIX程式師，你也可以使用規則語言來簡化你的程式而縮短你的開發時間。
正則運算式101
很多正則運算式的語法看起來很相似，這是因?你以前你沒有研究過它們。通配符是RE的一個結構類型，即重復操作。讓我們先看一看ERE標準的最通用的基本語法類型。?了能夠提供具有特定用途的範例，我將使用幾個不同的程式。
第二部分：
----------------------
字元比對
正則運算式的關鍵之處在於確定你要搜索比對的東西，如果沒有這一概念，Res將毫無用處。
每一個運算式都包含需要查找的指令，如表A所示。
Table A: Character-matching regular expressions
格式說明：
---------------
操作：
解釋：
例子：
結果：
----------------
.
Match any one character
grep .ord sample.txt
Will match “ford”, “lord”, “2ord”, etc. in the file sample.txt.
-----------------
[ ]
Match any one character listed between the brackets
grep [cng]ord sample.txt
Will match only “cord”, “nord”, and “gord”
---------------------
[^ ]
Match any one character not listed between the brackets
grep [^cn]ord sample.txt
Will match “lord”, “2ord”, etc. but not “cord” or “nord”
grep [a-zA-Z]ord sample.txt
Will match “aord”, “bord”, “Aord”, “Bord”, etc.
grep [^0-9]ord sample.txt
Will match “Aord”, “aord”, etc. but not “2ord”, etc.
重復操作符
重復操作符，或數量詞，都描述了查找一個特定字元的次數。它們常被用於字元比對語法以查找多行的字元，可參見表B。
Table B: Regular expression repetition operators
格式說明：
---------------
操作：
解釋：
例子：
結果：
----------------
?
Match any character one time, if it exists
egrep “?erd” sample.txt
Will match “berd”, “herd”, etc. and “erd”
------------------
*
Match declared element multiple times, if it exists
egrep “n.*rd” sample.txt
Will match “nerd”, “nrd”, “neard”, etc.
-------------------
+
Match declared element one or more times
egrep “[n]+erd” sample.txt
Will match “nerd”, “nnerd”, etc., but not “erd”
--------------------
{n}
Match declared element exactly n times
egrep “[a-z]{2}erd” sample.txt
Will match “cherd”, “blerd”, etc. but not “nerd”, “erd”, “buzzerd”, etc.
------------------------
{n,}
Match declared element at least n times
egrep “.{2,}erd” sample.txt
Will match “cherd” and “buzzerd”, but not “nerd”
------------------------
{n,N}
Match declared element at least n times, but not more than N times
egrep “n[e]{1,2}rd” sample.txt
Will match “nerd” and “neerd”
第三部分：
----------------
錨
錨是指它所要比對的格式，如圖C所示。使用它能方便你查找通用字元的合併。例如，我用vi行編輯器命令:s來代表substitute，這一命令的基本語法是：
s/pattern_to_match/pattern_to_substitute/
Table C: Regular expression anchors
-------------
操作
解釋
例子
結果
---------------
^
Match at the beginning of a line
s/^/blah /
Inserts “blah “ at the beginning of the line
---------------
$
Match at the end of a line
s/$/ blah/
Inserts “ blah” at the end of the line
---------------
\<
Match at the beginning of a word
s/\</blah/
Inserts “blah” at the beginning of the word
egrep “\<blah” sample.txt
Matches “blahfield”, etc.
------------------
\>
Match at the end of a word
s/\>/blah/
Inserts “blah” at the end of the word
egrep “\>blah” sample.txt
Matches “soupblah”, etc.
---------------
\b
Match at the beginning or end of a word
egrep “\bblah” sample.txt
Matches “blahcake” and “countblah”
-----------------
\B
Match in the middle of a word
egrep “\Bblah” sample.txt
Matches “sublahper”, etc.
間隔
Res中的另一可便之處是間隔(或插入)符號。實際上，這一符號相當於一個OR語句並代表|符號。下面的語句返回文件sample.txt中的“nerd” 和 “merd”的控制碼：
egrep “(n|m)erd” sample.txt
間隔功能非常強大，特別是當你尋找文件不同拼寫的時候，但你可以在下面的例子得到相同的結果：
egrep “[nm]erd” sample.txt
當你使用間隔功能與Res的高級特性連接在一起時，它的真正用處更能體現出來。
第四部分：
----------------
一些保留字元
Res的最後一個最重要特性是保留字元(也稱特定字元)。例如，如果你想要查找“ne*rd”和“ni*rd”的字元，格式比對語句“n[ei]*rd”與“neeeeerd” 和 “nieieierd”相符合，但並不是你要查找的字元。因?‘*’(星號)是個保留字元，你必須用一個反斜線符號來替代它，即：“n[ei]\*rd”。其他的保留字元包括：
^ (carat)
. (period)
[ (left bracket)
$ (dollar sign)
( (left parenthesis)
) (right parenthesis)
| (pipe)
* (asterisk)
+ (plus symbol)
? (question mark)
{ (left curly bracket, or left brace)
\ backslash
一旦你把以上這些字元包括在你的字元搜索中，毫無疑問Res變得非常的難讀。比如說以下的PHP中的eregi搜索引擎代碼就很難讀了。
eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*$",$sendto)
你可以看到，程式的意圖很難把握。但如果你?開保留字元，你常常會錯誤地理解代碼的意思。
總結
在本文中，我們揭開了正則運算式的神秘面紗，並列出了ERE標準的通用語法。如果你想閱覽Open Group組織的規則的完整描述，你可以參見：Regular Expressions，歡迎你在其中的討論區發表你的問題或觀點。

----------------------------------------
正則運算式和Java編程語言
-----------------------------------------
類和方法
下面的類根據正則運算式指定的模式，與字元序列進行比對。
Pattern 類別
Pattern類別的實例表示以字串形式指定的正則運算式，其語法類似於Perl所用的語法。
用字串形式指定的正則運算式，必須先編譯成Pattern類的實例。生成的模式用於創建Matcher物件，它根據正則運算式與任意字元序列進行比對。多個比對器可以共用一個模式，因?它是非專屬的。
用compile方法把給定的正則運算式編譯成模式，然後用 matcher方法創建一個比對器，這個比對器將根據此模式對給定輸入進行比對。pattern 方法可返回編譯這個模式所用的正則表達式。
split方法是一種方便的方法，它在與此模式比對的位置將給定輸入序列切分開。下面的例子演示了：

/** 用 split 以 逗號 和 空格 分隔的輸入字串進行切割*/
import java.util.regex.*;
public class Splitter {
	public static void main(String[] args) throws Exception {
		Pattern p = Pattern.compile("[,/\\s]+");
		String [] result = p.split("one,two, three four , five/six");
		for (int i = 0; i < result.length; i++) { System.out.println(result[i]); }
	}
}

Matcher 類別
Matcher類別的實例用於根據給定的字串序列模式，對字元序列進行比對。使用CharSequence介面把輸入提供給比對器，以便支援來自多種多樣輸入源的字元的比對。
通過調用某個模式的matcher方法，從這個模式生成比對器。比對器創建之後，就可以用它來執行三類不同的比對操作：
matches方法試圖根據此模式，對整個輸入序列進行比對。
lookingAt方法試圖根據此模式，從開始處對輸入序列進行比對。
find方法將掃描輸入序列，尋找下一個與模式比對的地方。
這些方法都會返回一個表示成功或失敗的布林值。如果比對成功，通過查詢比對器的狀態，可以獲得更多的資訊
這個類還定義了用新字串替換比對序列的方法，這些字串的內容如果需要的話，可以從比對結果推算得出。
appendReplacement方法先添加字串中從當前位置到下一個比對位置之間的所有字元，然後添加替換值。appendTail添加的是字串中從最後一次比對的位置之後開始，直到結尾的部分。
例如，在字串blahcatblahcatblah中，第一個 appendReplacement添加blahdog。第二個 appendReplacement添加blahdog，然後 appendTail添加blah，就生成了： blahdogblahdogblah。請參見示例簡單的單詞替換。
CharSequence介面
CharSequence介面?許多不同類型的字元序列提供了統一的只讀訪問。你提供要從不同來源搜索的資料。用String, StringBuffer 和CharBuffer實現CharSequence,，這樣就可以很容易地從它們那獲得要搜索的資料。如果這些可用資料源沒一個合適的，你可以通過實現CharSequence介面，編寫你自己的輸入源。
Regex情景範例
以下代碼範例演示了java.util.regex套裝軟體在各種常見情形下的用法：
簡單的單詞替換

public class Replacement {
	public static void main(String[] args) {
	// Create a pattern to match cat
	Pattern p = Pattern.compile("cat");
	// Create a matcher with an input string
	Matcher m = p.matcher("one cat," + " two cats in the yard");
	StringBuffer sb = new StringBuffer();
	boolean b = m.find();
	// Loop through and create a new String with the replacements
	while ( b ) {
		m.appendReplacement(sb,"dog");
		b = m.find();
	}
	// Add the last segment of input to the new String
	m.appendTail(sb);
	System.out.println(sb.toString());
	}
}

電子郵件確認
以下代碼是這樣一個例子：你可以檢查一些字元是不是一個電子郵件位址。它並不是一個完整的、適用於所有可能情形的電子郵件確認程式，但是可以在需要時加上它。

/** Checks for invalid characters* in email addresses*/
public class EmailValidation {public static void main(String[] args) throws Exception {
	String input = "@sun.com";
	//Checks for email addresses starting with
	//inappropriate symbols like dots or @ signs.
	Pattern p = Pattern.compile("^\\.|^\\@");
	Matcher m = p.matcher(input);
	if (m.find())System.err.println("Email addresses don't start" + " with dots or @ signs.");
	//Checks for email addresses that start with
	//www. and prints a message if it does.
	p = Pattern.compile("^www\\.");
	m = p.matcher(input);
	if (m.find()) {System.out.println("Email addresses don't start" + " with \"www.\", only web pages do.");}
	p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");
	m = p.matcher(input);
	StringBuffer sb = new StringBuffer();
	boolean result = m.find();
	boolean deletedIllegalChars = false;
	while(result) {
		deletedIllegalChars = true;
		m.appendReplacement(sb, "");
		result = m.find();
	}
	// Add the last segment of input to the new
	Stringm.appendTail(sb);
	input = sb.toString();
	if (deletedIllegalChars) {System.out.println("It contained incorrect characters" + " , such as spaces or commas.");}
	}
}

從文件中刪除控制字元

/* This class removes control characters from a named* file.*/
import java.util.regex.*;
import java.io.*;
public class Control {
	public static void main(String[] args) throws Exception {
		//Create a file object with the file name
		//in the argument:File
		fin = new File("fileName1");
		File fout = new File("fileName2");
		//Open and input and output
		streamFileInputStream fis = new FileInputStream(fin);
		FileOutputStream fos = new FileOutputStream(fout);
		BufferedReader in = new BufferedReader( new InputStreamReader(fis));
		BufferedWriter out = new BufferedWriter( new OutputStreamWriter(fos));
		// The pattern matches control
		charactersPattern p = Pattern.compile("{cntrl}");
		Matcher m = p.matcher("");
		String aLine = null;
		while((aLine = in.readLine()) != null) {
			m.reset(aLine);
			//Replaces control characters with an empty
			//string.
			String result = m.replaceAll("");
			out.write(result);
			out.newLine();
		}
		in.close();
		out.close();
	}
}

文件查找

/** Prints out the comments found in a .java file.*/
import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
public class CharBufferExample {
	public static void main(String[] args) throws Exception {
	// Create a pattern to match
	commentsPattern p = Pattern.compile("//.*$", Pattern.MULTILINE);
	// Get a Channel for the source
	fileFile f = new File("Replacement.java");
	FileInputStream fis = new FileInputStream(f);
	FileChannel fc = fis.getChannel();
	// Get a CharBuffer from the source
	fileByteBuffer bb = fc.map(FileChannel.MAP_RO, 0, (int)fc.size());
	Charset cs = Charset.forName("8859_1");
	CharsetDecoder cd = cs.newDecoder();
	CharBuffer cb = cd.decode(bb);
	// Run some
	matchesMatcher m = p.matcher(cb);
	while (m.find())System.out.println("Found comment: "+m.group());
	}
}

結論
現在Java編程語言中的模式比對和許多其他編程語言一樣靈活了。可以在應用程式中使用正則運算式，確保資料在輸入資料庫或發送給應用程式其他部分之前，格式是正確的，正則運算式還可以用於各種各樣的管理性工作。簡而言之，在Java編程中，可以在任何需要模式比對的地方使用正則運算式。

--------------------------------------------------------------------------------
[轉貼] JDK1.4之正規表示式 written by william chen(06/19/2002)
--------------------------------------------------------------------------------
什麼是正規表示式呢(Reqular Expressions)
就是針對檔案、字串，透過一種很特別的表示式來作search與replace。因為在unix上有很多系統設定都是存放在文字檔中，因此網管或程式設計常常需要作搜尋與取代，所以發展出一種特殊的命令叫做正規表示式。
我們可以很簡單的用 "s/</lt;/g" 這個正規式將字串中所有含有"<"的字元轉換成"lt;"，因此jdk1.4提供了一組正規表示式的package供大家使用，若是jdk1.4以下的可以到http://jakarta.apache.org/oro取得相關功能的package。
剛剛列出的一串符號" s/</lt;/g" 就是正規語法，所以請先瞭解正規的表示式
適用於j2sdk1.4的正規語法
"." 代表任何字元
正規式原字串符合之字串
. ab a
.. abc ab
"+" 代表一個或以個以上的字元
"*" 代表零個或是零個以上的字元
正規式原字串符合之字串
+ ab ab
* abc abc
"( )"群組
正規式原字串符合之字串
(ab)* aabab abab
字元類
正規式原字串符合之字串
[a-dA-D0-9]* abczA0 abcA0
[^a-d]* abe0 e0
[a-d]* abcdefgh abab
簡式
\d 等於 [0-9] 數字
\D 等於 [^0-9] 非數字
\s 等於 [ \t\n\x0B\f\r] 空白字元
\S 等於 [^ \t\n\x0B\f\r] 非空白字元
\w 等於 [a-zA-Z_0-9] 數字或是英文字
\W 等於 [^a-zA-Z_0-9] 非數字與英文字
每一行的開頭或結尾
^ 表示每行的開頭
$ 表示每行的結尾
--------------------------------------------------------------------------------
正規表示式 java.util.regex 相關的類別
Pattern—正規表示式的類別
Matcher—經過正規化的結果
PatternSyntaxExpression—Exception thrown while attempting to compile a regular expression
範例1: 將字串中所有符合"<"的字元取代成"lt;"
import java.io.*;
import java.util.regex.*;
/**
* 將字串中所有符合"<"的字元取代成"lt;"
*/
public static void replace01(){
// BufferedReader lets us read line-by-line
Reader r = new InputStreamReader( System.in );
BufferedReader br = new BufferedReader( r );
Pattern pattern = Pattern.compile( "<" ); // 搜尋某字串所有符合'<'的字元
try{
while (true) {
String line = br.readLine();
// Null line means input is exhausted
if (line==null)
break;
Matcher a = pattern.matcher(line);
while(a.find()){
System.out.println("搜尋到的字元是" + a.group());
}
System.out.println(a.replaceAll("lt;"));// 將所有符合字元取代成lt;
}
}catch(Exception ex){ex.printStackTrace();};
}
範例2:
import java.io.*;
import java.util.regex.*;
/**
* 類似StringTokenizer的功能
* 將字串以","分隔然後比對哪個token最長
*/
public static void search01(){
// BufferedReader lets us read line-by-line
Reader r = new InputStreamReader( System.in );
BufferedReader br = new BufferedReader( r );
Pattern pattern = Pattern.compile( ",\\s*" );// 搜尋某字串所有","的字元
try{
while (true) {
String line = br.readLine();
String words[] = pattern.split(line);
// Null line means input is exhausted
if (line==null)
break;
// -1 means we haven't found a word yet
int longest=-1;
int longestLength=0;
for (int i=0; i<words.length; ++i) {
System.out.println("分段:" + words[i] );
if (words[i].length() > longestLength) {
longest = i;
longestLength = words[i].length();
}
}
System.out.println( "長度最長為:" + words[longest] );
}
}catch(Exception ex){ex.printStackTrace();};
}
--------------------------------------------------------------------------------
其他的正規語法
/^\s* # 忽略每行開始的空白字元
(M(s|r|rs)\.) # 符合 Ms., Mrs., and Mr. (titles)
--------------------------------------------------------------------------------

標題
發表者名
內容圖示
內容	樣本 [詳情...]
認證碼		注意事項：預覽不需輸入認證碼，僅真正發送文章時才會檢查驗證碼。認證碼有效期10分鐘，若輸入資料超過10分鐘，請您備份內容後，重新整理本頁並貼回您的內容，再輸入驗證碼送出。
選項	不使用表情圖示