One of the most powerful uses of open source tools in the social media age is the ability to simultaneously aggregate large bodies of text from all over the world at any given time of day from multiple sources across an array of languages, cultures, and nationalities. Lexical analysis can, at its most basic level, show the most-searched-for terms on Google on any given day or show which keywords appeared most frequently. At a higher level, lexical analysis can parse meaning behind language and infer information about the people engaging in social media, including demographic characteristics such as age, social class, economic background, and education level.
In addition to analytic capabilities, advanced lexical analytic methods are often dependent on having a base corpus for reference. By corpus, in this context, we mean
not simply a large collection of text but a comprehensive body of text that provides the basis for the descriptive analysis of a language. While there are well-established corpora available for some languages, including English, Mandarin, and Russian, many languages lack established corpora, and some of the lexical analytic tools cannot be employed until such corpora are created. Machine learning, which is discussed in greater detail later in this chapter, is already helping to overcome some of the language deficits in lexical analysis, and it will continue to improve over time.