Monthly Archives: October 2011
Summarizing Software versus Human Agent Précis Writing
To what extend do summaries generated in a natural test environment live-up to product descriptions and how do they compare when they are pitched against summaries written by human agents? And how do the four summarizing products tested compare among themselves? Do different summarizers come up with the same results when fed the same text? So, is it plain low-level algorithm-based “copying and pasting” techniques versus higher order thinking skills in humans? In an analysis running 100 pages, I set out to find out exactly that.
Throughout the tests, no traces of human-like intelligent capabilities have been found in machine-generated summaries
With regard to “intelligent” properties, summarizers do not live up to the promises made in product descriptions. Commercial computer software (summarizers) cannot produce summaries which are comparable to those produced by human agents. Throughout the tests, and not unexpectedly, no trace of human-like intelligent capabilities has been found in machine-generated summaries. The methods summarizing software use are plain low-level algorithm-based “copying and pasting” techniques generating summaries in an automaton-like fashion. They are not the results of a mental process; current summarization software is incapable of generalization, condensation or abstraction. Summarizers extract or copy out or filter out original sentences or fragments in the right sequence but content-wise in an unconnected fashion. Summarizing software cannot distinguish the essential from the inessential; it cannot abstract the essence of original texts and condense them into a restatement (onto a higher conceptual plane). Summarizing software lacks the properties of abstract thinking, of analysis and synthesis. It has no insight; it cannot interpret facts and grasp concepts, let alone the wider overtones of any given text e.g. deliberately used humour, sarcasm and irony or biased tendencies. It cannot order and group facts and ideas, nor can it compare and contrast them or infer causes. Neither is it capable of condensing text into pithy restatements nor can it reproduce text into paraphrased abridgments, nor can it recast sentences at the most elementary level for the time being.
Direct comparisons of human brain functions used in précis-writing and summarizing softwares´ algorithms get short shrift in academic papers, at least in those available on this subject. Readers interested in this subject, yet unaccustomed to reading off-the-beaten track topics, may find this text interesting.
Reporting structures are the most frequent structures used in any language, yet little emphasis is placed on this fact in education and (foreign) language training as any textbook analysis will reveal. “Summarization is one of the most common acts of language behaviour. If we are asked to say what happened in a meeting, what someone has told us about another person or about an event, what a television programme was about, or what the latest news is from the Middle East, we are being asked or invited to express in condensed form the basic parts of an earlier spoken or written text or discourse.”1 Often, such summaries are shortened or constricted or abstracted onto another level. But they are very often the same verbs, verbal phrases etc. as used in reported speech.
In this test series, I compared summaries produced or created by human agents – also called abstracts, synopses, or précis – against extracts generated or copied out by various summarizing software or programs (software agents, also called computational agents). All human agent sample summaries have been taken from Cambridge Proficiency Examination practice books (UK English). My point of departure was the hype by software companies who all too frequently endow their summarizing software with human qualities, bordering on “personification”. Most of the product descriptions and reviews would have us believe that their computing power is fully comparable with human brain power. We are promised that these programs can determine what the text is about, extract the gist of any text, pinpoint the key concepts, and reduce long texts to their essentials. What is more, we are made to believe that they can analyze a text in natural language, “taking into account its structure and semantic relationships” and even get an in-depth understanding of the underlying idea.
I have taken up the challenge posed by said overblown statements, which represent summarizers often as “intelligent”, and pursued the question if the various commercial summarizing software programs available are mere number crunchers based on algorithms simply extract or copy out sentences and fragments, or whether they possess some kind of artificial intelligence akin to that of humans. This important difference between abstract thinking in human agents and automaton-like properties of summarizing software is looked into in some detail and supported by stringent test results confirming the superiority of the human brain over the unthinking, machinelike properties of summarizing software.
Is software “intelligent”? – And are human agents “truly” intelligent?
In academic papers, product descriptions for commercial summarization software, and generally in the field of AI, the term “intelligent agent” is frequently used in connection with software or software components. The degree to which present-day computer software, and summarisation software in particular, is “truly intelligent” is seldom a principal object of investigation, be that the lack of supposed relevance or the unavailability of investigative papers readily available to the public. Summarizing software being artificial intelligence (AI) software is said to be capable of generating complete summaries (extracts) which are sometimes misleadingly called précis, synopsis or abstracts, all terms which, rather, describe human-agent- produced summaries. In this analysis I have addressed the issue of the often ambiguous hype surrounding summarizing software. All too frequently, it is openly or implicitly invested with a human-like intelligence.
By gauging its performance in tests in which summarization software competes directly with human-agent-produced summaries taken from textbooks preparing for the CPE (Cambridge Proficiency Examination), I have explored its computational competence and the current state of supposed “intellectual” quality of results generated, or lack of both. Most present-day commercial summarizers and those tested in this analysis use the method based on extraction with summarizers copying out (copying and pasting) key sentences from the original text. In contrast, the abstractive summarization method is based on natural language processing, meaning the software needs to “understand” and “interpret” truly the original text and reproduce its most important information it in its own “words” in an abridged form. Present-day commercially available summarizing software using the abstraction method cannot do this satisfactorily and if there are any non-commercial summarizers in operation, they are difficult to check on.
Users hardly ever get complete, connected and readable summaries
Product descriptions assure potential buyers that summarizers can determine what a text is about, pinpoint core messages and key concepts and thus reduce long texts to their essentials. One software company even wants to make users believe that their summarizer can analyze text in natural language, “taking into account its structure and semantic relationships” and even get “an in-depth understanding of the underlying idea”. Furthermore, readers are promised that they can spend considerably less time understanding the general meaning of any document or website by just reading machine-extracted summaries without missing key information. However, the tests have shown that summarization softwares` machine-reading-comprehension properties lack accuracy since users hardly ever get complete, connected and readable summaries.
None of the summarizers generated reliably consistent, complete and impeccable extracts to be used as first-stage drafts for human agent editing
According to software companies, summarizers are mainly used as a time-saving reading aid, a kind of complete executive summary which supposedly allows the reader to spend considerably less time understanding the general meaning of documents, getting familiar with their structure and reading without missing key information. In order to meet the highest of standards, they would have to deliver consistent results and generate faultless and complete extracts. However, as it is often the case, theory does not square with practice at all since the tests show that the summarizing software tested is incapable of generating acceptable summaries due to a number of shortcomings outlined below. Neither was any of the summarizers tested able to distinguish itself from the others in any conspicuous way, save the number of irrelevant ideas generated, nor did any summarizer generate reliably consistent, complete and impeccable extracts which could be used as first-stage drafts for human agent editing.
Almost always, summarizers will extract the first sentence or the first two sentences because they have been programmed to do so since these are deemed to be lexically loaded and contain the essence of the text. If the first sentence contains conflicting, subordinate ideas or anecdotal content, it can make the summary less useful if not downright wrong with the summarizer extracting the negligible first sentence(s) at the expense of more salient ideas which may then not be extracted because of settings limiting the choice. This matter is aggravated when the first sentence is long and contains trivial subordinate “ideas”. Summarizers cannot recognise these irrelevant parts and do not leave them out as human agents would. The latter holds true for all compound sentences extracted.
Extracted or filtered out sentences lack cohesion and resemble bitty bits scattered across the pages
Filtered out or extracted sentences lack unity; they are disjointed and scattered in list form across the page or highlighted in colour in similar fashion. Most of them have the appearance of brute-force-copied-and-pasted text fragments with the summaries generated by Microsoft Word’s summarizing function being the only exception. The latter constricts the sentences selected into impressive-looking paragraphs, thus making it seem that a lexical interrelationship between the “key” sentences selected is preserved. However, a more refined analysis shows that this may only be partially true. The fact that contextually unconnected sentences are placed one after the other under the false pretence that text cohesion is created or retained does not make a better summary or reading any easier. In one case a “novel” but wrong or misrelated grammatical relations was established which was not present in the original and which substantially changed the meaning of the extract under investigation. It is safe to assume that this is no isolated incidence, particularly when sentences begin with a pronoun.
Routinely, the majority of the supposed key sentences extracted are of minor importance or completely irrelevant. Different summarizing programs filter out different “key elements” and in one case the most important idea was completely missed out or “overlooked” by all four summarizers tested. Looking at it from an end-user’s perspective, one can reasonably expect that all summarizing software copy out identical key sentences. Nonetheless, there is all too often a too great a difference in the sentences extracted. In one randomly chosen case, there was only a 30% agreement on the text extracted (measured in number of words) between two summarizers. With longer texts, results were even more varied, which casts some doubt on the algorithm-based selection mechanism employed by different software makers.
When the nature of the original text makes summarization software look good
There are two examples exemplifying that the nature of the original text can make any summarization software look good. In one of the tests there has been an acceptable level of computational sentence extraction achieved. The other example was appendixed to an academic draft paper for easy verification. In the first case, it is the high number of equally relevant key ideas given when the choice of sentences extracted did not matter. As a result the summary is balanced and even human agents could have made subjectively tinged choices without seeming to have missed out key ideas. The second example from the academic draft paper is entirely written in reported speech and gives the appearance of being connected and relatively coherent. It can be deduced that reported speech or reporting structures with different introductory verbs or phrases serving as semantic links which provide local text cohesion is “summarization software friendly” in general. Or distort test results accordingly as these semantic links were written by human agents and duly filtered out or copied by summarizers. Thus it would be an example of extreme partiality to pass this copying process off as a software achievement. All things considered, I think that the test results give rise to speculation about whether acceptable sentences extracted are just fluke hits and, therefore, no final conclusions can be drawn.
There is the issue of the optimum text length for machine-generated summaries. When done by human agents, full précis writing is usually about 1/3 in text length as opposed to partial or incomplete summaries which concentrate only on certain thematic aspects. It is save to assume that this is the optimal text length for full summary writing since it has stood the test of time. Perhaps the standard setting for machine-generated text abridgments should be raised from 25% to 35%. Together with the next-generation of AI software, this is likely to render better-quality extracts and provide a better balance of key elements extracted, particularly in long texts (1000 or more words of original text length).
Summarizing software as a first-stage drafter for human-agent précis writing – currently an act of faith and not quality editing
Summarizing software is also meant to be a kind of first-stage drafter for human-agent précis writing, providing a short-list of ideas with the human agent smoothing them or bringing them into a more acceptable, i.e. coherent format. At least, this has been predicted by some linguists. The commercial software tested is not suited for this purpose and I do not know if there are more sophisticated summarizing programs other than those commercially available. If summarizers are ever used as fully functional first-stage drafter, the role of human agents would be confined to just connecting and polishing sentences extracted by summarizers into a readable, coherent format. In this case, human agents would serve as mere text editors without having to read the original text themselves. Anything else would be self-defeating and make summarizing software redundant. This also poses the question whether human agents may use machine-extracted draft-summaries in good faith as a basis to rely on to produce coherent abstracts – be they on a higher level of abstraction or just barely edited machine extracts ‒ without reading the full text, which would then be an act of faith and not an act of reason. Present-day summarizing software is not up to par to be used as first-stage drafters and I am very much interested to learn if the next generation of summarizers will still operate on a lower order of “thinking”.
Disconnected and incomplete summaries – a new way of processing information?
On the subject of scattered and disjointed and incoherent sentences in summarizer-software generated summaries as discussed in this analysis, there are new, related phenomena to be observed in other areas. According to the linguist Raffaele Simone, “a new way of processing information” has developed marked by the predominance of less complex over more complex structures. Incoherent machine-produced summaries with disjointed sentences are certainly less complex than coherent human agent abstracts. In a wider sense, bulleted lists and the limitations of MS PowerPoint and similar software are further points in case for this new way of processing information. Software used for presentations is conspicuous by limited writing space hardly suitable to be the carrier of more complex ideas. Further examples of this new way of processing and presenting information can be found in some UK tabloid online newspapers with “uncluttered” single sentences displayed with generous spacing but without paragraphs. In education, a new kind of exercises in language teaching which favours matching and arranging unconnected or isolated sentences have to a large degree replaced more difficult and long comprehension exercises. Thus they constitute less complex structures, facilitating quick visual perception of easy-format alphabetical information “at a glance”.
A retrograde evolutionary step?
In what way these developments are to the detriment of higher thinking order capabilities in human agents can at this point in time not be objectively established for the absence of any studies readily available on this subject, save the reports on the alarming decline in average intelligence among 18 year olds (2008). This was verified by two reliable German sources. Moreover, it stands to reason that a shift from traditional summary writing involving higher order thinking skills to accepting machine-made, disjointed extracted sentences in summary writing and (unintentionally) dismissing the training of abstract thinking faculties in education as negligible, may be an evolutionary throw-back starting any time soon. However, I should point out that in no way am I insinuating that some kind of deliberate behavioural conditioning is going on to adapt human agent mental capacities to the limited, number crunching properties of software.
User acceptance – what is really known about what users think?
With summaries generated by software being as unsatisfactory as they are, it is surprising that there are no verifiable test results or critical reviews readily available. Little is known about what users really think about the quality of summarizing software. Perhaps people have different views about what a key idea is or they are satisfied with partial and irrelevant extracts when they find what interests them. Or they may fill in the gaps left in machine-generated summaries from their own prior knowledge and experience, thus correcting faulty summaries or supplementing missing information while reading them without bothering about quality. Maybe users assume summaries to be good and / or it is their unshakeable belief in computer experts and their software which makes them accept anything machine-produced because, having grown up in a largely uncritical environment, they do not know otherwise. Furthermore, it cannot be precluded that some users may want to vent their dissatisfaction with deficient summarization software but they lack the ability to find the weak spots and articulate their frustration accordingly. More discriminate users may be resigned to putting up with what they deem to be barely mediocre software-generated-summaries due to their low level of expectation as they have become accustomed to not expecting too much.
What compounds this issue is the fact that the human brain tends to attribute sense to any “input”, meaning that even downright wrong summaries can be interpreted as “intelligent” and well-founded because users assume that the computer is infallible, and hence summaries make sense to the person reading them. This fact was confirmed in a test-series of trick-lectures which did not make any sense at all. Yet, educated native speakers found the lectures “comprehensible” and “stimulating” and believed in the authority of “Dr Fox”, an actor hired for this purpose.
Software generated summaries are far from being “intelligent”; they are difficult to read with little text cohesion, disjointed sentences scattered across the page and too many irrelevant sentences extracted or copied out
The evaluation of the test results with regard to intellectual properties ascribed to summarising software was, of course, a foregone conclusion. The difference now is that I have shown in some detail the difference between how human agent summaries are created and software summarizations are generated. The machine generated summaries are far from being “intelligent”; they are difficult to read with little text cohesion, disjointed sentences scattered across the page and too many irrelevant sentences extracted or copied out. Generating extracts the way they do at present, summarization software is dispensable, the main reason being that they completely lack higher order “thinking skills”, properties indispensable for recognizing key messages and conceptual ideas in a text. At present, summarization software could not even be used as first-stage drafts for human agent editing. I think that users will have to wait for the next generation of AI intelligence software before summarizers can be fully relied on. Hopefully, the next generation of software takes data processing to a true level of natural language processing. Until such time, one had better use the advanced search function in search-engines to pre-select topics of interest and rely on one’s own close reading and / or speed reading techniques.
1 SUMMARIZATION: SOME PROBLEMS AND METHODS, John Hutchins, University of East Anglia: [From: Meaning: the frontier of informatics. Informatics 9. Proceedings of a conference jointly sponsored by Aslib, the Aslib Informatics Group, and the Information Retrieval Specialist Group of the British Computer Society, King's College Cambridge, 26-27 March 1987; edited by Kevin P. Jones. (London: Aslib, 1987), p. 151-173.]