基于python语言的大数据搜索引擎( 二 )

主要分割
主要分割使用空格来分词,实际的分词逻辑中,还会有其它的分隔符 。例如Splunk的缺省分割符包括以下这些,用户也可以定义自己的分割符 。
] < > ( ) { } | ! ; , ' " * s & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29
def minor_segments(s): """ Perform minor segmenting on a string. This is like major segmenting, except it also captures from the start of the input to each break. """ minor_breaks = '_.' last = -1 results = set() for idx, ch in enumerate(s): if ch in minor_breaks: segment = s[last+1:idx] results.add(segment) segment = s[:idx] results.add(segment) last = idx segment = s[last+1:] results.add(segment) results.add(s) return results次要分割
次要分割和主要分割的逻辑类似,只是还会把从开始部分到当前分割的结果加入 。例如“1.2.3.4”的次要分割会有1,2,3,4,1.2,1.2.3
def segments(event): """Simple wrapper around major_segments / minor_segments""" results = set() for major in major_segments(event): for minor in minor_segments(major): results.add(minor) return results分词的逻辑就是对文本先进行主要分割,对每一个主要分割在进行次要分割 。然后把所有分出来的词返回 。
我们看看这段 code是如何运行的:
for term in segments('src_ip = 1.2.3.4'): print termsrc1.21.2.3.4src_ip311.2.3ip2=4搜索
好了,有个分词和布隆过滤器这两个利器的支撑后,我们就可以来实现搜索的功能了 。
上代码:
class Splunk(object): def __init__(self): self.bf = Bloomfilter(64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event(self, event): """Adds an event to this object""" # Generate a unique ID for the event, and save it event_id = len(self.events) self.events.append(event) # Add each term to the bloomfilter, and track the event by each term for term in segments(event): self.bf.add_value(term) if term not in self.terms: self.terms[term] = set() self.terms[term].add(event_id) def search(self, term): """Search for a single term, and yield all the events that contain it""" # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory) if not self.bf.might_contain(term): return # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx if term not in self.terms: return for event_id in sorted(self.terms[term]): yield self.events[event_id]

  • Splunk代表一个拥有搜索功能的索引集合
  • 每一个集合中包含一个布隆过滤器,一个倒排词表(字典),和一个存储所有事件的数组
  • 当一个事件被加入到索引的时候,会做以下的逻辑
  • 为每一个事件生成一个unqie id,这里就是序号
  • 对事件进行分词,把每一个词加入到倒排词表,也就是每一个词对应的事件的id的映射结构,注意,一个词可能对应多个事件,所以倒排表的的值是一个Set 。倒排表是绝大部分搜索引擎的核心功能 。
  • 当一个词被搜索的时候,会做以下的逻辑
  • 检查布隆过滤器,如果为假,直接返回
  • 检查词表,如果被搜索单词不在词表中,直接返回
  • 在倒排表中找到所有对应的事件id,然后返回事件的内容
我们运行下看看把:
s = Splunk()s.add_event('src_ip = 1.2.3.4')s.add_event('src_ip = 5.6.7.8')s.add_event('dst_ip = 1.2.3.4')for event in s.search('1.2.3.4'): print eventprint '-'for event in s.search('src_ip'): print eventprint '-'for event in s.search('ip'): print eventsrc_ip = 1.2.3.4dst_ip = 1.2.3.4-src_ip = 1.2.3.4src_ip = 5.6.7.8-src_ip = 1.2.3.4src_ip = 5.6.7.8dst_ip = 1.2.3.4是不是很赞!
更复杂的搜索
更进一步,在搜索过程中,我们想用And和Or来实现更复杂的搜索逻辑 。
上代码:
class SplunkM(object): def __init__(self): self.bf = Bloomfilter(64) self.terms = {} # Dictionary of term to set of events self.events = [] def add_event(self, event): """Adds an event to this object""" # Generate a unique ID for the event, and save it event_id = len(self.events) self.events.append(event) # Add each term to the bloomfilter, and track the event by each term for term in segments(event): self.bf.add_value(term) if term not in self.terms: self.terms[term] = set() self.terms[term].add(event_id) def search_all(self, terms): """Search for an AND of all terms""" # Start with the universe of all events... results = set(range(len(self.events))) for term in terms: # If a term isn't present at all then we can stop looking if not self.bf.might_contain(term): return if term not in self.terms: return # Drop events that don't match from our results results = results.intersection(self.terms[term]) for event_id in sorted(results): yield self.events[event_id] def search_any(self, terms): """Search for an OR of all terms""" results = set() for term in terms: # If a term isn't present, we skip it, but don't stop if not self.bf.might_contain(term): continue if term not in self.terms: continue # Add these events to our results results = results.union(self.terms[term]) for event_id in sorted(results): yield self.events[event_id]


推荐阅读