基于python语言的大数据搜索引擎( 二 ) _python

主要分割
主要分割使用空格来分词，实际的分词逻辑中，还会有其它的分隔符。例如Splunk的缺省分割符包括以下这些，用户也可以定义自己的分割符。
] < > ( ) { } | ! ; , ' " * s & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
def minor_segments(s): """ Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the input to each break. """ minor_breaks = '_.' last = -1 results = set() for idx, ch in enumerate(s): if ch in minor_breaks: segment = s[last+1:idx] results.add(segment) segment = s[:idx] results.add(segment) last = idx segment = s[last+1:] results.add(segment) results.add(s) return results次要分割
次要分割和主要分割的逻辑类似，只是还会把从开始部分到当前分割的结果加入。例如“1.2.3.4”的次要分割会有1，2，3，4，1.2，1.2.3
def segments(event): """Simple wrapper around major_segments / minor_segments""" results = set() for major in major_segments(event): for minor in minor_segments(major): results.add(minor) return results分词的逻辑就是对文本先进行主要分割，对每一个主要分割在进行次要分割。然后把所有分出来的词返回。
我们看看这段 code是如何运行的：
for term in segments('src_ip = 1.2.3.4'): print termsrc1.21.2.3.4src_ip311.2.3ip2=4搜索
好了，有个分词和布隆过滤器这两个利器的支撑后，我们就可以来实现搜索的功能了。
上代码：
class Splunk(object): def __init__(self): self.bf = Bloomfilter(64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event(self, event): """Adds an event to this object""" # Generate a unique ID for the event, and save it event_id = len(self.events) self.events.append(event) # Add each term to the bloomfilter, and track the event by each term for term in segments(event): self.bf.add_value(term) if term not in self.terms: self.terms[term] = set() self.terms[term].add(event_id) def search(self, term): """Search for a single term, and yield all the events that contain it""" # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory) if not self.bf.might_contain(term): return # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx if term not in self.terms: return for event_id in sorted(self.terms[term]): yield self.events[event_id]

Splunk代表一个拥有搜索功能的索引集合
每一个集合中包含一个布隆过滤器，一个倒排词表（字典），和一个存储所有事件的数组
当一个事件被加入到索引的时候，会做以下的逻辑
为每一个事件生成一个unqie id，这里就是序号
对事件进行分词，把每一个词加入到倒排词表，也就是每一个词对应的事件的id的映射结构，注意，一个词可能对应多个事件，所以倒排表的的值是一个Set 。倒排表是绝大部分搜索引擎的核心功能。
当一个词被搜索的时候，会做以下的逻辑
检查布隆过滤器，如果为假，直接返回
检查词表，如果被搜索单词不在词表中，直接返回
在倒排表中找到所有对应的事件id，然后返回事件的内容

我们运行下看看把：
s = Splunk()s.add_event('src_ip = 1.2.3.4')s.add_event('src_ip = 5.6.7.8')s.add_event('dst_ip = 1.2.3.4')for event in s.search('1.2.3.4'): print eventprint '-'for event in s.search('src_ip'): print eventprint '-'for event in s.search('ip'): print eventsrc_ip = 1.2.3.4dst_ip = 1.2.3.4-src_ip = 1.2.3.4src_ip = 5.6.7.8-src_ip = 1.2.3.4src_ip = 5.6.7.8dst_ip = 1.2.3.4是不是很赞！
更复杂的搜索
更进一步，在搜索过程中，我们想用And和Or来实现更复杂的搜索逻辑。
上代码：
class SplunkM(object): def __init__(self): self.bf = Bloomfilter(64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event(self, event): """Adds an event to this object""" # Generate a unique ID for the event, and save it event_id = len(self.events) self.events.append(event) # Add each term to the bloomfilter, and track the event by each term for term in segments(event): self.bf.add_value(term) if term not in self.terms: self.terms[term] = set() self.terms[term].add(event_id) def search_all(self, terms): """Search for an AND of all terms""" # Start with the universe of all events... results = set(range(len(self.events))) for term in terms: # If a term isn't present at all then we can stop looking if not self.bf.might_contain(term): return if term not in self.terms: return # Drop events that don't match from our results results = results.intersection(self.terms[term]) for event_id in sorted(results): yield self.events[event_id] def search_any(self, terms): """Search for an OR of all terms""" results = set() for term in terms: # If a term isn't present, we skip it, but don't stop if not self.bf.might_contain(term): continue if term not in self.terms: continue # Add these events to our results results = results.union(self.terms[term]) for event_id in sorted(results): yield self.events[event_id]