## High-Dimensional Classification Essay

## Viewpoint: Quantum liquids move to a higher dimension

- Tameem Albash and Stephan Haas, Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089-0484, USA

• *Physics* 4, 62

A spin model on a honeycomb lattice points to a much sought after type of quantum spin liquid: the Bose metal.

In a quantum spin liquid, the ground state neither has long-range magnetic order nor does it break any spatial symmetry. Quantum fluctuations—analogous to thermal fluctuations in a classical liquid—can destabilize ordered crystalline phases to produce this exotic phase of matter. Quantum spin liquids are of particular interest as they violate conventional wisdom about condensation of matter as one approaches zero temperature, e.g., to a superconducting phase. Since quantum fluctuations increase in strength as the dimensionality of the system is decreased, the search for quantum liquids has largely occurred in dimensions smaller than three.

There is ample evidence—quantum Hall edge states [1], for example—for quantum liquid behavior in one spatial dimension, and there are general arguments that show that extended gapless phases—where the energy gap between the ground state and the first excited state vanishes—can emerge in higher dimensions because of quantum criticality, i.e., in parameter regimes where two ordered phases compete to suppress each other (Fig. 1).

Of particular interest is the case of gapless spin liquids. Two classes of gapless spin liquids have been identified, which can be distinguished by the extent of their zero-energy hypersurfaces in reciprocal space [2]: (i) algebraic spin liquids, for which the elementary excitations are not described by free fermions or bosons, and spin correlations have a finite set of singularities at discrete values of momentum; and (ii) spin Bose metals, whose elementary excitations are described by free bosons, with spin correlations that are singular along an extended surface in momentum space. This “Bose surface” is reminiscent of the Fermi surface in conventional metals, except that it can be tuned by parameters of the Hamiltonian, even at fixed particle density. Despite their thorough classification [3], robust examples of extended quantum liquid phases in higher dimensions have been elusive. Much work has been done in constructing examples, such as ring-exchange models [4], but these do not appear to be easy to realize experimentally. This seems to suggest that entropy in nature tends to select ordered states in higher dimensions, even in the presence of strong quantum fluctuations.

Nonetheless, in a study reported in *Physical Review Letters*[5], Christopher Varney of the Joint Quantum Institute at the University of Maryland, College Park, and Georgetown University, Washington D.C., and his colleagues present a convincing example of a Bose metal phase in two dimensions. The authors use techniques in numerical diagonalization to examine the zero-temperature phase diagram of the frustrated spin- model on a honeycomb lattice (see Fig. 2). Monitoring the ground-state fidelity metric, a useful theoretical tool borrowed from quantum information theory [6], and the energy of finite clusters as a function of the competing antiferromagnetic nearest-neighbor and next-nearest-neighbor couplings, the authors identify three quantum phase transitions that separate four phases. The ratio of the competing antiferromagnetic couplings corresponds to the frustration of the system, which here serves as the tuning parameter that governs the phase diagram (akin to the axis in Fig. 1). Furthermore, the behavior of the condensate fraction in each of the four regimes identifies three of these regimes as ordered phases, corresponding to quantum renormalized versions of classical spin configurations (i.e., where the spins are perturbed, but not destroyed, by quantum fluctuations), while the phase with a vanishing condensate fraction is consistent with a quantum spin liquid [7].

Studying the momentum distribution functions in the ordered phases provides experimentally verifiable evidence for spinon condensation at various points in the Brillouin zone. For example, at small frustration the condensate is a simple antiferromagnet, whereas at very large frustration it corresponds to the collinear spin configuration depicted in Fig. 2. As the spin- model can be mapped to an equivalent model of hard-core bosons, these ordered phases correspond to Bose condensates at distinct ordering wave vectors. As an interesting aside, Varney *et al.* noted that in some of the condensates more than one single-particle state is occupied.

The main result of Varney *et al.*’s study, however, is the finding of an extended quantum spin liquid phase. This regime features a Bose surface with a wave vector that is tunable by the frustration of the model. This particular feature in the momentum distribution function is a defining characteristic of a Bose metal, and would thus serve as a key observation in an experimental study.

The simple model, less prone to the singlet formation that would tend to stabilize valence bond order, i.e., a spatially frozen dimer decoration of the underlying lattice, is arguably a better candidate for such quantum spin liquid behavior than the Heisenberg model. Consider, for example, the frustrated Heisenberg model on a square lattice, with competing antiferromagnetic interactions between nearest-neighbor and next-nearest-neighbor spins, a familiar system in condensed matter physics that has been a contender for exhibiting an extended quantum spin liquid regime [8]. The numerical evidence in favor of a gapless quantum spin liquid phase is much less clear cut in this case because of the presence of competing valence bond crystal order, which is difficult to study with available numerical techniques.

The ultimate experimental verification of this Bose metal phase in the spin- model on a honeycomb lattice will probably hinge on an implementation that uses cold atoms trapped in optical lattices, i.e., relying on a mapping of the spin system to an equivalent model of interacting hard-core bosons. In the past few years, this technology has become a reliable emulation tool for quantum many-body lattice systems with tunable interactions [9]. Bose metal phases are also expected to occur at the interface between superconducting and insulating phases in quasi-two-dimensional films [10]. In this case, disorder in the phase of the superconducting order parameter is expected to lead to a specific type of phase glass that coexists with a bosonic metallic channel. However, it is not yet clear whether there is any connection between the phase glass scenario and the microscopic model studied by Varney *et al.* Nonetheless, the condensed matter community should be excited to finally have access to a straightforward quantum many-body system that exhibits a Bose metal phase.

## References

- B. I. Halperin, P. A. Lee, and N. Read, Phys. Rev. B
**47**, 7312 (1993) - O. I. Motrunich and M. P. A. Fisher, Phys. Rev. B
**75**, 235116 (2007); D. N. Sheng, O. I. Motrunich, and M. P. A. Fisher,**79**, 205112 (2009); T. Senthil and P. A. Lee,**71**, 174515 (2005) - X.-G. Wen, Phys. Rev. B
**65**, 165113 (2002) - R. G. Melko, A. W. Sandvik, and D. J. Scalapino, Phys. Rev. B
**69**, 100408 (2004); R. G. Melko and A. W. Sandvik, Phys. Rev. E**72**, 026702 (2005); A. Paramekanti, L. Balents, and M. P. A. Fisher, Phys. Rev. B**66**, 054526 (2002); L. Balents and A. Paramekanti,**67**, 134427 (2003) - C. N. Varney, K. Sun, V. Galitski, and M. Rigol, Phys. Rev. Lett.
**107**, 077201 (2011) - D. F. Abasto, A. Hamma, and P. Zanardi, Phys. Rev. A
**78**, 010301 (2008) - S. Sachdev,
*Quantum Phase Transition,*(Cambridge University Press, Cambridge, 2011)[Amazon][WorldCat] - E. Dagotto and A. Moreo, Phys. Rev. B
**39**, 4744 (1989); L. Capriotti, F. Becca, A. Parola, and S. Sorella, Phys. Rev. Lett.**87**, 097201 (2001); Phys. Rev. B**67**, 212402 (2003) - I. Bloch, J. Dalibard, and W. Zwerger, Rev. Mod. Phys.
**80**, 885 (2008); M. A. Cazalilla, R. Citro, T. Giamarchi, E. Orignac, and M. Rigol, (to be published); arXiv:1101.5337 - P. Phillips and D. Dalidovich, Science
**302**, 243 (2003)

## About the Authors

Tameem Albash is a joint post-doc between the High Energy group and the Computational Condensed Matter Theory group at the University of Southern California. He received his Ph.D. from the University of Southern California in 2010 for work on applications of AdS/CFT to condensed matter systems.

Stephan Haas is a condensed matter theorist with research interests in quantum information theory, quantum magnetism, and unconventional superconductivity. His research group investigates microscopic models of interacting electronic systems, using numerical techniques such as Quantum Monte Carlo to study their phase diagrams, thermodynamics, and excitation spectra. He received his Ph.D. from Florida State University, followed by a postdoc at the Swiss Federal Institute of Technology. He is currently a professor at the University of Southern California.

## Subject Areas

MagnetismStrongly Correlated Materials

## Related Articles

### Viewpoint: Quantum Spin Torque

Quantum effects may play an important role in spin-transfer torque—a phenomenon in which a spin-polarized current controls the magnetization of a thin layer of material. Read More »

^{1}Department of Computer and Radio Communications Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul 136-713, Republic of Korea^{2}Research Institute of Korean Studies, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul 136-713, Republic of Korea

Copyright © 2015 Bong-Jun Yi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.

#### 1. Introduction

Generally, essay scoring is performed manually by skilled assessment experts. However, when essays are scored manually, there are a couple of limitations. First, it is difficult to acquire consistent results from the scoring, because of human errors and biased preconceptions. Second, it requires a considerable amount of time and effort in scoring. Third, it is impractical for humans to provide a detailed analysis or individual feedback. Consequently, there is growing interest in a computerized system that can automatically assess essays, since the system could potentially assist or even replace human assessors.

So far, most AES approaches have tried to find an appropriate ML algorithm and have focused on finding useful features for training the ML algorithm for AES [1–5]. In this paper, we attempt to use all of the features, which have been proven to be useful in training the AES system, and analyze the effects of integrating those features. However, if the vast numbers of features are used for training the system all together, the following problems may arise.(1)The limited number of training data does not properly represent the expanded high-dimensional feature space; thus, optimized training cannot be performed.(2)The vast number of features must be considered at the same time; thus, an increase in training time is required.

The phenomenon (1) is referred to as the “curse of dimensionality.” Most features used for AES have values of integers or real numbers, and there are hundreds of features. Thus, a system using these features cannot help but fall into the “curse of dimensionality.” ML based systems must train systems with high-dimensional data by spending a tremendous of time for training. Therefore, it is essential that the feature space be reduced, so that optimal performance can be obtained, and training can be made more efficient.

In this paper, we experiment with and analyze the effects of three different techniques to reduce the feature space: normalization, discretization, and feature selection. The normalization techniques can transform the different ranges of each feature value into a fixed range, thereby reducing the whole range of feature values. The discretization techniques combine feature values represented by numbers into corresponding groups and convert feature values belonging to a specific group into one corresponding integer value, thus reducing the number of feature values. The feature selection techniques reduce feature space by selecting and using features that are relevant to answers and features that are easily distinguishable from different samples.

The normalization, discretization, or feature selection techniques unfortunately do not always have positive effects on the performance of the application. An appropriate combination of feature optimization techniques must be selected for corresponding domains with ML algorithms. Our research shows that using the appropriate feature optimization techniques can reduce the dimensions of features and thus result in the efficient training and performance improvement of AES.

The remainder of this paper is organized as follows.(i)In Section 2, we discuss previous approaches to AES and the reduction of feature space.(ii)Section 3 presents the ML based AES architecture, ML algorithms for AES, and diverse features for ML based AES.(iii)In Section 4, feature optimization techniques (mainly normalization, discretization, and feature selection techniques) are described.(iv)In Section 5, we discuss and analyze the experimental results of various feature optimization techniques.(v)Finally, in Section 6, we conclude the paper.

#### 2. Related Works

Various studies on AES have taken place. Often, research that is based on the ML approach focuses on exploring novel features and learning methods to improve the performance for essay scoring. Project Essay Grade [1], the first AES study, used multiple regression (MR) and achieved correlation values that were similar to those achieved by human assessors. The study used mostly number based features, including the number of words, phrases, parentheses, apostrophes, commas, periods, colons, and semicolons. Intelligent essay assessor (IEA) [2] assessed only the content of essays that were written by native English speakers, using latent semantic analysis (LSA). Words were in essays as a vector of the semantic space; the vector dimension was reduced by using LSA. The ungraded essays were compared with graded essays, by using the cosine similarity measure; the score for the most similarly graded essay was assigned to the score of the ungraded essay. BETSY [6] evaluated essays using a classifier that is based on Naive Bayes (NB), which employs specific words and phrases as features. The E-rater [7] system used new features that were extracted using natural language processing (NLP) techniques. In 2010 and 2012, a few studies have used the support vector regression (SVR) or ranking SVM ML algorithm to effectively combine various features [3–5].

The previous AES studies avoided using too many features, because various kinds of useful features are not widely known, and the increase in the number of features would increase the training time for ML methods. In order to utilize various and vast amounts of useful features, to efficiently perform AES, the appropriate feature optimization techniques, such as the normalization technique, discretization technique, and feature selection technique, are required.

Several different approaches have been developed to reduce feature space, by using normalization and discretization techniques, thereby improving the performance [8–11]. Shalabi et al. have applied three normalization techniques (min-max, -score, and decimal point) to image data in the preprocessing step of ML [8]; Jayalakshmi and Santhakumaran have applied various normalization techniques to medical data concerning diabetes diagnosis [9]. Two researches have experimented and compared various normalization techniques and shown that performance improvement is possible, by employing an appropriate normalization technique for the task.

Chmielewski and Grzymala-Busse’s study [10] and Dougherty et al.’s study [11] are representative studies for comparing discretization techniques, which can reduce the dimensions of feature values. Chmielewski and Grzymala-Busse proposed a discretization technique, based on entropy [10], and Dougherty et al. proposed a discretization technique, which can apply to the NB based supervised ML method [11]. Both studies have compared their proposed approaches with previous discretization methods, by employing various sets of experimental data, and their experimental results have shown that their proposed approaches have significantly improved the classification accuracy.

There have also been various approaches to reducing the dimensions of high-dimensional data, by selecting appropriate features from the vast number of possible features in many domains, by using good feature selection techniques [12–15]. Yang and Pedersen experimented and compared feature selection techniques, including document frequency, information gain, mutual information, chi-square, and term strength in the domain of the text categorization [12]. Kakkonen et al. succeeded in reducing the dimensions of feature space by using LSA, PLSA, and LDA techniques in the AES domain [13]. Yu and Liu improved performance in 10 domains: medical, chemical, census, insurance company, spoken letters, and so forth [14]. Kalousis et al. also achieved performance improvement in three domains: proteomics, genomics, and text mining, by reducing the dimensions of high-dimensional data [15].

So far, we have introduced studies on various domains, which have shown performance improvement by reducing dimensionality. In this paper, we introduce a new domain that can reduce the dimension of features by applying feature optimization techniques.

#### 3. Automated Essay Scoring Based on Machine Learning

According to previous studies, there are many features that have been found to be useful for AES. In this section, we provide diverse features for AES and a brief description of useful ML algorithms for AES.

##### 3.1. Features of AES

In this study, we include most features that have been proven to be useful for AES; our newly proposed features, including advanced NLP techniques, are also used in conjunction.

The frequency, average, and ratio of characters, words, and sentences are used for the basic features. Under the assumption that the distribution of the part-of-speeches (POSs) is different, according to the essay grades, we also used the features relating to POS. The level of vocabulary usage is evaluated by using external resources, including elementary and middle dictionaries and the dictionary for the graduate record examination (GRE), and is also used for features. The various features relating to -gram are also used, after being extracted from the lexical based language model and the POS based language model, which are constructed from large amount of external corpora. The naturalness of each sentence is measured according to its perplexity and is also used for features. The numbers of compound nouns, noun phrases, named entities, discourse markers, and grammatical errors are calculated by using advanced NLP techniques, which are also used for features.

The number of features used in our study exceeds 300 and is represented in groups, in Table 1. Representing an essay with such a large number of features that have a diverse range of feature values would mean that the vector, including all features, would yield high-dimensional data. Therefore, it is mandatory that a process should be developed to reduce dimensions, via feature optimization, in order to efficiently train such a large number of features.

**Table 1: **It represents the list of features used for learning AES.

##### 3.2. Machine Learning Algorithms for AES

So far, there have been many attempts to utilize ML algorithms for AES. The ML algorithms for AES can be classified into two categories: regression and classification. Regression is used for predicting or estimating the corresponding target value given the feature values of the specific instance by analyzing the relationships between the feature values and the target value. Classification is used to determine the corresponding category of the specific instance.

The big difference between the two approaches is in the characteristics of the target values. The target values for regression are ordered and continuous, while the target values for classification are unordered and discretized. For AES, the regression approaches try to predict the continuous target score based on feature values that represent the characteristics of essays; the classification approaches try to identify the categorized score under the assumption that each essay belongs to a specific category.

In this study, we have experimented and compared two different regression based ML algorithms and two different classification based ML algorithms for AES and have tried to identify the appropriate ML algorithms for AES by performing different experimentations.

In this section, we provide a brief description of the four ML algorithms, which are applied in our experimentation for AES: MR model [16], maximum entropy (ME) model [16], support vector machine (SVM) [17], and SVR [18]. We selected these ML algorithms for the following reasons.(i)MR is the most widely used algorithm in AES research.(ii)ME achieves good results in document classification, using many features.(iii)SVM is the best algorithm for solving various classification problems.(iv)SVR applies regression to SVM, and it is expected that SVR may have the benefits for both regression and classification.

###### 3.2.1. Multiple Regression Model

The MR model [16] is the oldest [1] and the most widely used approach among ML based AES studies. This model tries to find that satisfies the expression in (1), under the assumption that the relationship between the feature values and target values is linear:

Each instance of the training data represents an essay, and an essay is represented by feature values (e.g., the length of essay and the number of advanced vocabularies) and the target value (i.e., the score of the essay). The real feature values denoted by consist of possible different values, and always has the value, 1, reflecting the constant weight, . The training process is to find the optimal weight, , for predicting the essay scores most precisely in all of the training data. The sum-squared error technique is widely used for finding optimal weight, .

###### 3.2.2. Maximum Entropy Model

The ME model [16] is also referred to as the multinomial logistic regression model. The ME model is widely used in many applications as an alternative to the NB model, because there is no independent assumption, unlike in the NB, in which there is a strong assumption that each feature is independent; NB was once applied for AES [6]. We have chosen the ME model for the AES experiment, because the model has been a good classification algorithm that showed high performance in various tasks, including text categorization, POS tagging, and Named Entity Recognition [19–22].

The fundamental formula for the ME model is represented in expression (2). This model tries to find the corresponding probability value that belongs to class (grade of the essay) when the specific instance (essay) is given. The specific instance is assigned to the class that has the highest probability value among the classes with calculated probability values. The specific instance is represented by a combination of various features denoted by , and there are always exits that correspond with the weight , depending on the feature . The normalization factor is needed to convert the exponential value into a true probability:

The final formula of the ME model is represented by expression (3). The in (2) is converted to in (3), because the feature for each class has a different weight. The function, , is an indicator function, since only 0 and 1 are used for feature values in ME. The normalization constant in the denominator is calculated by making a summation of computation results for every class. The feature values for the essay, including the length of the essay and the number of advanced vocabularies, are transformed into 0 or 1 by the indicator function, multiplied by weights, and then combined into the total value by the log linear model. The probability of assigning the grade to the given essay is calculated for each grade separately, and the final grade is produced for the essay score based on the value of the probability:

###### 3.2.3. Support Vector Machine

The SVM [17] is one of the most representative ML algorithms for classification. The SVM algorithm represents each instance by a dot in the high-dimensional space with the same number of dimensions for the number of features and finds the most appropriate hyperplane to properly separate the dots.

In Figure 1, two classes are represented by white dots and black dots, indicating the training instances described in the two-dimensional space; there are only two features and . The hyperplane is determined by and and is represented by the following expression:

In the training process for SVM, it tries to find a for which the distance is maximized from the nearest instance. The class of the new instance is determined by identifying the appropriate space among the possible spaces, separated by the hyperplane.

The task for AES is to classify an essay into one of six grades, by utilizing more than 300 different kinds of features. In order to find many hyperplanes to separate the six different grades into more than 300 high-dimensional spaces, much time is needed for optimization. Therefore, the feature space must be reduced through feature optimization techniques.

###### 3.2.4. Support Vector Regression

SVR [18] is the algorithm for applying SVM to regression. According to Li and Yan’s study, it is known as the best ML algorithm for AES [4]. Thus, we chose SVR for our experimentation:

In the training process for SVR, it tries to find , represented in expression (5), which satisfies the conditions represented by expression (6). It tries to find the hyperplane, which makes the smallest against the instances for which the distance from the hyperplane is less than .

##### 3.3. System Architecture of AES

Figure 2 shows the architecture of the AES system, based on ML methods. There are two work processes: learning and prediction. Labeled training data is input into the NLP module in the learning process, and unlabeled data is also input into the NLP module in prediction process. Identical processing is applied from the natural language process to the feature converting process in the learning and prediction processes. First, essays are input into the NLP module, which includes the sentence breaker, tokenizer, lemmatizer, POS tagger, noun phrase extractor, and named entity extractor. The results from the NLP module are delivered to the feature extractor.

**Figure 2: **Automated essay scoring system architecture based on machine learning.

The feature extractor uses all natural language processed information to extract features that accurately represent essays characteristics. More than 300 diverse features can be employed for training the AES model. The features can be classified into six categories, as shown in Table 1. There are features relating to length (e.g., the number of characters, words, and sentences), features relating to ratio (e.g., the proportion of a specified discourse marker or POS tag), and manufactured features from the result on NLP or statistics. These features have been used in recent studies.

The feature optimizer selectively performs normalization, discretization, and feature selection for optimal performance. The labeled essays are input for training into the learner module, and the unlabeled essays are input for prediction into the predictor module. The learner module is used to create the training model, and the training model is used in the predictor module to calculate the final grade (score) of an essay.

In this paper, we focus on ML dependent feature optimization techniques for reducing dimensions of features, because the performance of AES is dependent on ML algorithms and feature optimization.

#### 4. Feature Optimization

##### 4.1. Normalization

In research studies based on ML, features that have a variety of types and ranges are used. The number of words and the ratio of words in an elementary dictionary that are used by the feature values in AES also have different ranges. When we use these features directly in the AES system, it is possible to perform nonoptimized learning. Therefore, we must convert all feature values into a fixed range.

Although there are a number of normalization methods, we selected the two most commonly used methods: min-max and -score normalization. Min-max normalization is generally known for achieving the best performance, according to related works [8, 9]:

In Formula (7), the min-max normalization converts all feature values into a fixed range, while keeping the origin interval. In Formula (8), the -score normalization converts all feature values into a -score for normal distribution, without keeping the origin interval. Symbol is the set of all feature values for one feature and is extracted from the training data; is one feature value to normalize.

##### 4.2. Discretization

The discretizing feature values simplify data representation by converting continuous feature values that belong to a specific range into a certain feature value. This process makes the feature values suitable for ML. The type of discretization method, the range of the discretized section, and the number of discretized sections all affect the performance of the ML system.

In our study, we used two simple discretization methods: discretization by instance number (DIN) and discretization by feature value (DFV). DIN assigns the same number of instances to each section, after sorting them by the feature value. DFV converts all feature values that belong to the specific range into one feature value, after setting the range of feature values for each section. DIN is advantageous when the distribution of feature values is uniform. DFV is beneficial when the distribution of feature values is normal.

##### 4.3. Feature Selection

Feature selection filters out noisy features and discovers the optimal feature set in ML. Even though we determine a feature set with appropriate intuition and assumptions, some features in the set may produce a negative effect. Further, too many features can hinder learning or delay it. In this work, we compare three feature selection methods: correlation (COR) [14], information gain (IG) [14], and minimal-redundancy-maximal-relevance (mRMR) [23]. Feature selection is performed based on the relevance between the feature value and the golden score. We select the top number of features by using the high relevance order between the feature value and the golden score. Then, we use Pearson’s correlation coefficient and information gain to measure the relevance. These are popular measures that can calculate the relevance between the two sets.

Formula (9) is a correlation formula between the random variables , . In this case, is the golden score and is one feature. and are case of each sample and and are mean of random variables , :

## One thought on “High-Dimensional Classification Essay”