已經以行和列格式存在的數據或者可以很容易地轉換為行和列的數據,以便之後它可以很好地適合資料庫,這被稱為結構化數據。 例如CSV,TXT,XLS檔等。這些檔有一個分隔符號,固定寬度或可變寬度,其中缺失值在分隔符號之間表示為空白。 但有時候我們會得到一些行不是固定寬度的數據,或者它們只是HTML,圖像或pdf檔。 這些數據被稱為非結構化數據。 儘管可以通過處理HTML標籤來處理HTML檔,但是來自Twitter的提要或來自新聞提要的純文本文檔可以在不具有分隔符號的情況下不具有要處理的標籤。 在這種情況下,我們使用來自各種python庫的不同內置函數來處理檔。
讀取數據
在下面的例子中,我們獲取一個文本檔並讀取檔,將檔中的每一行分隔開來。 接下來可以將輸出分成更多的行和單詞。 原始檔是一個包含描述Python語言的段落的文本檔。
filename = 'path\input.txt'
with open(filename) as fn:
# Read each line
   ln = fn.readline()
# Keep count of lines
   lncnt = 1
   while ln:
       print("Line {}: {}".format(lncnt, ln.strip()))
       ln = fn.readline()
       lncnt += 1
當執行上面的代碼時,它會產生以下結果。
Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.
計數單詞出現頻率
可以使用計數器函數來計算檔單詞的頻率,如下所示。
from collections import Counter
with open(r'pathinput2.txt') as f:
               p = Counter(f.read().split())
               print(p)
當我們執行上面的代碼時,它會產生以下結果。
Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '1991,': 1, 'is': 1, 'programming.': 1, 'dynamic': 1, 'an': 1, 'design': 1, 'in': 1, 'high-level': 1, 'management.': 1, 'features': 1, 'readability,': 1, 'van': 1, 'both': 1, 'for': 1, 'Rossum': 1, 'system': 1, 'provides': 1, 'memory': 1, 'has': 1, 'type': 1, 'enable': 1, 'Created': 1, 'philosophy': 1, 'constructs': 1, 'emphasizes': 1, 'general-purpose': 1, 'notably': 1, 'released': 1, 'significant': 1, 'Guido': 1, 'using': 1, 'interpreted': 1, 'by': 1, 'on': 1, 'language': 1, 'whitespace.': 1, 'clear': 1, 'It': 1, 'large': 1, 'small': 1, 'automatic': 1, 'scales.': 1, 'first': 1})
						上一篇:
								Python讀取HTML頁面
												下一篇:
								Python單詞標記化
					
					