Chapter 1: Understanding Generative AI

James F. McGrath; Ankur Gupta

1 Chapter 1: Understanding Generative AI

The biggest issue related to education and AI is not the fact that students are using AI. Rather, the biggest issue is that both students and educators share the same basic misunderstanding of what AI is and is not, and what it can and cannot do. This book aims to address that fundamental misunderstanding and argues that education in the humanities is even more crucial in the era of generative AI, not less. We offer concrete suggestions for assignments that can be readily deployed in classrooms today. The assignments we propose emphasize and preserve the sacred core of true exploration that characterizes the humanities—things like the reasoning process, effective writing skills, and historical context. They remove the need for educators to police student AI use, and do so with no fundamental reduction or alteration of the skill set that students hone through those assignments.

This book offers a twofold message to educators in the humanities who have been in a state of increasing panic and despair since the appearance of ChatGPT. On the one hand, the skills of information literacy and critical thinking, of reading and writing well, are more valuable now than ever before. As you will see later in this book, not only will that not change, but it cannot change. The current (and foreseeable long-term future of) AI is not going to replace you, and in fact requires those very same skill sets—both in the creation and training of AI as well as in a human’s engagement with it.

On the other hand, this is no time for business as usual. New technology has always had a transformative impact on education, and AI is no different in this regard. The invention of writing, of papyrus, of television, of computers, of the Internet—each has contributed in positive ways even as it has also been lamented for what was being lost. In that sense, this book serves as a wake-up call (or a call to action) for educators across disciplines.

A number of books have appeared about AI and education since the release of ChatGPT. Surprisingly few of them offer concrete suggestions for assignments that will work in humanities courses in this context.^{^[1]} Some are replete with recommendations for handing aspects of teaching over to AI, from the questions students are asked to the grading of their work. It is striking that there are so many books and software developers trying to appeal to educators to do precisely what educators are trying to get students not to do—namely, have AI do for them that which is their own responsibility. You can even find books that have been created or co-created by ChatGPT on this subject. In this book you will find something very different.

A Quick Primer on LLMs

You may have already heard or read about what a large language model (LLM) is in your own investigations. Some people recognize it more readily when referred to as ChatGPT, though there are many other examples: Claude, Gemini, Bard, and DeepSeek, to name just a few. For writing convenience, we will refer to these collectively as LLMs or will focus on ChatGPT as the most familiar example, unless the context requires clarification or the example we are giving involved the use of a specific other LLM. Even if you haven’t heard of anything related to LLMs, the following will provide an accessible, reliable description of the technology itself. If you have used them, in what follows you’ll get an explanation of what is going on behind the scenes, what makes them tick.^{^[2]}

An LLM is one of many techniques that is used in the general area of AI. At its core, AI is an umbrella of computer-based techniques that computer scientists use to solve very hard problems, including, in particular, problems that take a very long time to solve correctly. For example, it is extremely hard to schedule all the flight routes for an airline in order to maximize profits. Such a schedule would have to account for crew availability, decide which routes to fly (hopefully popular ones), minimize fuel costs, build in downtime for aircraft maintenance, and a hundred other parameters that you would need to incorporate into such a complex process.

If we wanted to get a correct answer for this scheduling problem in just a hundred different airports, it would take SEVERAL HUNDRED MILLION YEARS to calculate the correct answer.^{^[3]} By that time, all pilots, flight crew, and passengers would probably have made other arrangements. So, what do human beings do in this case? Well, we guess. Don’t get me wrong, we make informed and reasonable guesses based on our understanding of the problem at hand. But then we just tinker around until we get something workable. That solution might not be optimal for maximizing profits, but we will have a decent plan before we are all dead. An admirable goal.

Taking this example a bit further, our solution to this problem in the real world was to use hub airports to simplify scheduling. The idea is simple: Send everyone to a central place, then reroute everyone from there to their final destination. This plan is inefficient—after all, now each passenger occupies two seats on the way to their destination, the journey takes a lot longer, and the airline incurs extra costs in running two flight crews and paying twice the fuel costs, while also having to discount the fare to entice passengers in the first place! Nevertheless, it is a plan that works.

The use of hubs simplifies the problem to make it manageable, and thus computable. To get an AI to actually produce speech because it understands and thinks as a human does is a similarly hard problem that no one can currently solve. AI programmers thus found a workaround: give an AI so many examples of speech that it can imitate its patterns through an essentially mechanical process. Thus an LLM processes your prompt (often called an input or query) and generates a sequence of words that is “highly likely” to be a relevant response to the query. Sometimes, its guess is pretty good, and we get something we are impressed with. Other times, it’s wrong or nonsensical, and we are not so impressed.

Let’s dive in a bit deeper into how an LLM generates speech. And since we’re writing a book about how to teach humanities courses, let’s use a human example. Imagine how a child learns to speak. Parents expose them to conversation after conversation, pointing at objects, colors, and people to describe them. We pantomime verbs and use mimicry to explain how some words represent actions and others don’t. Eventually, parents give up in an exhausted heap, and invariably, it is then, and only then, that the child will speak their first word.^{^[4]}

Success! We laugh, we cry tears of joy, and we make scrapbooks with pictures of our child’s victory in communicating. The child will then experience more words, in more contexts, and learn from those cues. What they are doing is sophisticated pattern-matching: understanding which sounds represent words, concepts, and ideas, along with the critical information about how they fit together. Over time, the child will develop from single-word announcements to sentences. Even later, they will learn how to string sentences into brief exchanges and, eventually, full conversations. The human child will learn to reason and will ultimately become a thinking adult, together with all the creativity, ingenuity, and cleverness that humans can muster.

LLMs “learn” in much the same way, but a better term to describe this learning process is to say that an LLM is trained. We emphasize this distinction here because, unlike a child, an LLM does not have any understanding of the text patterns it produces—it merely parrots the data upon which it was trained.^{^[5]} Furthermore, we also want to make clear that, whenever possible, we will avoid personifying an LLM in this book because, also unlike a child, it has no capacity to think.^{^[6]} A child (eventually) conveys information in response to their thought process. An LLM regurgitates words that often appear together, and at least as far as today’s technology is concerned, that is all it can ever do. A human child has the inherent capacity to learn to use language to convey their own thoughts and ideas. An LLM has nothing equivalent to this.^{^[7]}

We train LLMs on vast repositories of text data: books, web pages, news articles, research papers, essays, poetry, legal briefs, et cetera. Anything that we’ve ever written down (and thereafter digitized) is fair game for an LLM training dataset. An LLM takes that large body of stuff and infers grammar, vocabulary, and text construction by using these examples. It is creating an ad hoc set of rules or patterns by which it will generate text. Training a model in this way involves billions of examples, and the more data it has at its disposal, the better it can predict which word ought to come next in a sentence, or even which paragraphs.

To process these examples of language, LLMs use a computational architecture known as a neural network. Neural networks are inspired by the structure of the human brain; they are composed of neurons organized in layers (for convenience), and together, they can identify patterns in the training data. Neurons, at least in computer science, are simple mathematical functions, but those functions help us record patterns. Think of each neuron as representing or storing one piece of evidence about the training data. When you put all that evidence together, you might be able to make a more expansive conclusion about the whole dataset.

Another helpful analogy might be to think of each layer of neurons as a filter through which language data passes. Each layer is designed to capture a different aspect of language, at a different level of detail. For example, an early layer might focus on detecting basic syntax, such as whether a word is a noun or verb. A later layer might recognize words that tend to appear together, like “peanut butter” or “taco bell.”^{^[8]} Ever deeper layers might determine the sentiment of a sentence (positive or negative) or the relationship between different concepts in a portion of text.

Ultimately, the more layers a network has, or the deeper the network is, the more refined it is likely to become in its pattern-finding mission. The depth of a network corresponds roughly to its computational complexity – in other words, the deeper the network, the longer it takes to compute an answer. The neurons in these layers are connected to one another in a weighted fashion, which means that some patterns matter more, whereas others matter less. In this way, the network “learns” about the relative importance of the patterns and can predict coherent word patterns with greater accuracy. Training a model boils down to updating these weights in response to the training data.

OK, time to breathe. We need to talk about transformers. (Unfortunately not the Transformers from comic books, cartoons, or movies.) Transformers are a new addition to neural networks that allows an LLM to emphasize or focus on different parts of a block of text (such as a sentence or a paragraph) and change the words’ weights in some particularly meaningful way, even if those words are considered out of order. In computer science, we call this concept self-attention, and it allows an LLM to keep track of the context of language much better than it could before. Think of this idea as the LLM being able to remember what happened in one paragraph while reading the next one, and you’ve got the right idea of what’s going on. Another way to describe this is to say transformers allow us to more easily group words or word phrases based on their contextual distance from one another, regardless of their physical distance in the text.

For example, consider the sentence “Humpty Dumpty sat on a wall.” In a transformer, it will be more likely to realize that the phrase “Humpty Dumpty” is closely related contextually to “sat.” Later, when poor Humpty falls, a transformer can infer that something has changed about the noun that was identified. In parallel with this observation, it might also realize that “wall” has unusual importance in this sentence, rather than simply being a peripheral object in the space.

So, we have the nuts and bolts: a big pile of neurons (or more precisely, in computer science terms, perceptrons) linked up together with weighted scores representing relative importance in a neural network, which is organized in conceptual layers to aid in progressive filtering of data to glean language constructs. And on top of that, we’ve got Decepticons, um, transformers that build fancy clouds of words that aid in connecting content to concept in powerful ways. But we have to train the neural network appropriately, which is time consuming and hard.

In particular, the training process will fail if it is not done at scale. In fact, without the massive computational time we spend training the underlying neural networks, these AI models would be, well, terrible. When we said we take a large body of “stuff,” we were actually talking about trillions of documents! That’s where the “large” in large language models comes from. For example, the amount of filtered plain text on which GPT-3 was trained was 570 GB.^{^[9]} However, increasing the number of documents isn’t by itself good enough to get a high-quality model that responds well. An LLM also needs to be trained on diverse datasets that contain information about a wide variety of topics, represented in stylistically different ways, and conveyed in many different languages, to name just a few examples.

Conversely, sometimes a training dataset needs to be highly specialized so that it can answer a particular kind of question. For example, you wouldn’t want an LLM that is answering diagnostic questions for a doctor to have information about nursery rhymes that talk about heads, shoulders, knees, and toes.^{^[10]} The decision about which data to put into a training set is itself complicated, and when you hear about bias in AI or machine learning, this sort of a priori decision-making may be part of the reason for that bias.

Our discussion about guessing and understanding the contextual limitations of generative AI ties back to why computer scientists are unsure whether, how, and when AGI (artificial general intelligence) will happen. AGI refers to the capacity of an AI system to perform generalized and diverse tasks in the real world credibly enough to rival that of human intelligence. Almost everyone agrees that we’re making rapid progress toward creating AGI, but whether we’re 10 percent, 20 percent, or 70 percent there isn’t at all clear. A big part of the rapid evolution of AI so far is a combination of a key algorithmic idea (transformers) and as well as access to massively parallel and distributed computer networks that essentially brute-forced their way through data to achieve generative AI.^{^[11]}

Optimistically (which may be triggering), we have shown increasing sophistication in our language understanding via generative AI. Furthermore, we will continue to develop hardware, such as more powerful graphic processing units (GPUs) or specialized AI chips, that will enable more complex computations, deeper neural networks, and more meaningful transformer filtration techniques. The AGI optimist points to examples of generative AI writing poetry, songs, and essays, or engaging as a thought partner on scholarly research.

The AGI skeptic acknowledges that evolving the highly specialized systems we have today (known as narrow AI) to a more general domain would require a lot more innovation beyond the technology that we are using and depending on right now. Not even computer scientists know what makes “general understanding” tick, and assuming that some advanced AI system will suddenly begin doing it is at best optimistic, and at worst wildly unrealistic. To achieve it, we would have to develop technology to replicate human flexibility and resilience, creativity, and abstract problem-solving capability.^{^[12]} Furthermore, all of our current AI systems assume that whatever we say is automatically correct. Try asking ChatGPT a complicated enough question, and it will default to asking you what you think about it instead of giving you a clear answer.

Regardless of your stance on the viability of AGI, even in the generative AI world we have today, it is essential to understand and prepare for the future. In the narrow scope of this book (not entirely unlike narrow AI), we focus on how that preparation might take shape in the context of teaching, learning, and engagement in the humanities. We encourage similar exploration in other disciplines, and even in other areas of your life!

A Note on DeepSeek

DeepSeek is a relatively new entrant to the generative AI offerings. Though the purpose of this book is not to provide a comprehensive survey of the entire landscape of AI models and tools, we do want to lay out a few of the key comparisons between this recent development and other models. DeepSeek is designed to answer queries more quickly than ChatGPT. The primary way in which it does this is that it answers each query by utilizing only a portion of the data it has trained – first pre-filtering the query to the most relevant data and then answering questions only with that particular partition of the data. This is beneficial for computing answers more quickly and also allows it to be trained to answer more narrow questions much more deeply. A narrower partition allows for a more deeply layered computation – which hopefully gives a better, more relevant result from DeepSeek.

In other words, you would not encounter references to the nursery rhyme “head, shoulders, knees, and toes” when using DeepSeek for medical diagnosis like we talked about earlier. DeepSeek is more suited to solve problems that are technical and require mathematical depth, such as those often found in STEM disciplines. ChatGPT, in contrast, uses all of its data on every single query. It is far more consistent in its output but takes longer to provide its results. These queries naturally connect more documents and more data, which tends to be more appropriate for the more interconnected humanities disciplines, which are not nearly as siloed as STEM often necessarily is.

In short, DeepSeek is like a precision specialty tool, optimized for specific tasks and domains, whereas ChatGPT is a Swiss Army knife, designed for versatility and adaptability across a wide range of applications.

Finally, the biggest difference with DeepSeek is that its code base has been made available to everyone free of charge. Anyone can download it, train it on their own data, and use it on their own computer. (You do have to know how to write or deal with code to do so though.) In some sense, they have democratized AI for everyone with no perceivable profit motive. Additionally, since the full code is available, you can be certain that your friendly neighborhood computer science colleague can verify whether or not there is anything to worry about with running the program.^{^[13]} (Spoiler alert: There do not appear to be any issues so far, but you may prefer one privacy policy over another!)

A Quick Primer on the Limitations of LLMs

It is important to recall where we started our journey with LLMs—with a series of educated guesses about how to provide a sensible answer to a very hard problem that would otherwise take forever to solve properly. LLMs function in exactly this way. Generating an answer to a query that sounds coherent, conversational, and even at times introspective is a very hard problem. Frankly, achieving even one of those goals is a very hard problem.

An LLM solves that problem not by understanding the query’s meaning but by comparing the cloud of words in the query to the same cloud of words among the texts in its training data to generate a response to the query in the same style. This process is going on in the AI system whether the intended output is text (as with ChatGPT), music, sound, images, video, or something else. The AI is a machine that has learned through sheer computing power how to play the game of language, much as was done to get a computer to learn to play chess. The result is very cool, but the machine is not all knowing. It’s fancy, but it’s not nuanced or sophisticated. It is literally a brute-force tool that has simply memorized everything we’ve ever said and organized it to facilitate responses to queries. In terms of the traditional fears associated with AI, the sky is not falling, and we do not need to prepare to bow to our supreme AI overlords. Let’s dig into why:

An LLM is a complex statistical calculator. You plug in your math problem (i.e., your query), the LLM churns a bit, and then it gives you an “answer.” But unlike a true calculator that will give you the right answer every single time, an LLM cannot do so. Why? First of all, language does not normally have only one right response to a question. Even when it comes to a yes or no question, an “aye” or “nay” is not entirely beyond the realm of linguistic possibility. There is nothing in the pattern of language usage that would allow an AI chatbot to identify and provide one correct response to any prompt. It is stringing together words based on probabilities, which is a bit like guessing. When we respond to another human being, we draw on our knowledge and use words to express our thoughts. An LLM has no thoughts and has no mechanism to identify anything in the texts it trained on as knowledge or factual. We trained it to calculate a likely combination of words and output them. Because there are multiple possible responses to most combinations of words, if you ask it to respond again, it might offer the same answer or a different one. Your guess about what it would do is as good as ours.
An LLM can tell you things that aren’t true. Why? Because an LLM doesn’t know anything about truth. It knows about the propensity of certain words to be strongly related to one another. It gives those words a shake and sprinkles them out in sentence form for you to explore. In this way, it can create “hallucinations,” or, as I like to call them, mistakes—references that do not exist, misrepresentations of key facts (such as about the existence of Atlantis), or contradictory statements because they use the same cloud of words.
An LLM does a poor job of speaking to a minority concept or opinion. Because an LLM picks up patterns in training data, the patterns it picks up most strongly are those that are expressed often. In other words, it does not do well when there is an uncommon opinion or limited training data about a topic. (If you have noticed LLMs improving in some such areas, that is because they have been further trained with specific data that incorporate that minority pattern or content.)
An LLM cannot do anything useful if its training data has a blind spot where your query happens to land.
An LLM does not have the ability to apply common sense to its answer. An LLM may say that it uses common sense, but that is because it is emulating patterns of words in human speech in which human writers referred to common sense. It does not engage in reasoning, and it cannot answer queries that involve abstract reasoning or complex problem-solving (unless those topics were addressed as such in its training data). Similarly, if an LLM appears to be reasoning, it is actually mimicking patterns of speech produced by humans, who can reason.
An LLM cannot retain its context for an overly long time. Eventually, it will “forget” the contextual past and either repeat its historical answers or be unable to strongly infer the connections needed for the self-attention transformers to relate concepts across large text distances.
An LLM is extremely resource intensive, even when performing the tasks we currently ask it to perform. It is expensive to both train and maintain.
An LLM can generate text that has a tone or style, but it does not possess the capacity to actually feel those things or generate text with a particular intentionality. It mimics human speech. Because the text on which it trained expresses human emotion, it combines words in ways that would express emotion if humans used it. In the same way, it uses “I” even though it has no self-awareness of the sort we associate with someone speaking about themselves.

So, if LLMs (and AI in general) are so limited, why are we so apprehensive? What is the path forward from here?

Contextualizing the Relative Strengths of LLMs for the Humanities

You will undoubtedly hear comparisons made between the impact of ChatGPT on the humanities and the impact of the calculator on math and physics. Yet there is a crucial difference. As we pointed out earlier, a calculator that is functioning properly will give the right answer each and every time. An LLM may or may not do so, because it isn’t an information calculator or anything similar. It is a speech imitator. It is a machine learning system that was trained on large repositories of human communication on the Internet up until that point. It has mapped the relationships between words throughout this dataset, and as a result, it can put together strings of words that follow these same patterns, without simply reproducing the precise text in any of them.

You may encounter descriptions of an LLM as a “glorified autocomplete.” This is a somewhat useful analogy but has serious flaws. Autocomplete offers you words based on the frequency of connections to what you just typed. But there are too many possibilities, so autocomplete proves useful when you are writing things that are the most conventional and predictable and equally frustrates you when you need to express something specific to your situation. That task is not what an LLM does, although it is based on the same statistical frequency of relationships among words. Like autocomplete, an LLM has no understanding of words. It can, however, imitate the patterns among words in human speech, having been trained on an extensive set of examples of human text. Because an LLM does not need to guess what you want to say and merely says something that will almost certainly make sense based on its training data, it performs much more impressively than autocomplete.

This is why an LLM has such a propensity to confidently tell you things that are not true, to provide you with sources that do not exist, and in other ways fabricate (a better term than hallucinate) answers.^{^[14]} The reason some have made the mistake of thinking LLMs are information providers is that they do provide information, and more often than not it is accurate information. But they do not know they are doing that, because they do not know anything. There are pieces of information woven into the patterns of human speech on which they were trained, and that information often shows up in their outputs. Regarding academic references, sometimes LLMs mix and match author names, titles of books and articles, or their subtitles, and other details, creating incorrect citations. They are not BS-ing, at least not in the sense that our students do. Unlike our students, LLMs simply don’t have any way of grasping that the pieces of a reference must always remain together for them to be useful.^{^[15]} LLMs aren’t trying to be useful, because they aren’t trying to do anything. When we acknowledge that distinction, we can appreciate just how well they do the task they were trained in. They imitate human speech in a very consistent and realistic manner.

The reason why LLMs do such a good job of responding to typical humanities assignments is that the Internet is full of the content that educators expect students to produce. In fact, some students’ past assignments may be there, but even if not, there is a range of treatments of basic introductory topics in every field. Using these may help you in the short term, while overreliance on them will inevitably get you in trouble. Compare AI grammar tools like Grammarly or those built into your word processing software. Given how many students express themselves poorly, an AI tool of this sort will often make suggestions that are improvements. But if every such suggestion is implemented mechanically, then hilarious flubs will also find their way into the result, given that the suggestions are based on patterns and not on comprehension of what a human wants to say. So too with LLMs. They will provide a student with an accurate summary of a novel or a philosophical concept much of the time. Sometimes they won’t, and so there is a strong likelihood that, sooner or later, a student who relied on LLM-generated content will submit something so ridiculously disconnected from their course that it will reveal what they have been up to. That probability, however, does not eliminate the need to adjust our teaching in response to this new technology.

One reason we need to teach students not to rely on AI in inappropriate ways is that their future success may depend on their ability to innovate. Because AI is based entirely on patterns found in past data, suggesting radical new directions is something that AI is not well poised to do.^{^[16]} The few exceptions that come to mind are of the sort that prove the rule. AI’s discovery of new proteins that do not exist in nature happened not because of any initiative on the part of an AI, and the Nobel Prize for this discovery was not awarded to the Alphafold AI system but to the people who developed and used it. If the same people had designed the system but not used it, the system would never have discovered anything on its own. If the program were run but humans did not look at the output, the significance of new proteins among them would never have been noticed or appreciated. What happened is that human researchers tasked an AI with exploring all the possible permutations of certain molecules and examined the results. Such use of AI by human innovators is where the most exciting developments of tomorrow will be achieved. This is why students absolutely must understand what AI is and learn to use it appropriately, and at the same time why they must learn to cultivate their distinctly human abilities.

Recent work on LLMs has included a focus on what is called the Google-Proof Q&A Benchmark (GPQA). Experts in a variety of fields were asked to come up with questions that would be impossible to answer simply through Googling. As of November 2023, the latest model of LLM had achieved a level of accuracy that was only slightly above that which an ordinary person who was not an expert in relevant fields could achieve by Googling and guessing. Both the ordinary human and the AI got a failing grade, whereas human content area experts did much better (but still found the questions difficult).^{^[17]} Hopefully you can now understand (if you could not before reading this book) why the ability of an LLM to do well on a law or medical school exam does not mean it is ready to replace lawyers or doctors. What lawyers, doctors, and other professionals do day to day is not to answer questions of the kinds they faced on exams. The knowledge that exams test is important but not sufficient or comprehensive. The wisdom, the practical application of knowledge, that lawyers and doctors and other human professionals need to have on the job involves things that LLMs cannot do and that none is even being tested on, because it is impossible. The closest thing is perhaps those occasions when an LLM was asked to generate a brief for a court case and the result was lawyers facing penalties for proffering appeals to nonexistent court cases as precedent. The issue, as we have repeatedly emphasized, was not that ChatGPT fabricated nonexistent court cases. That is perfectly in keeping with its function and operation and should surprise no one who understands the technology. The issue was people treating a chatbot as a reliable source of information.

Even if there were to turn out to be a way to make an LLM chatbot only output accurate information, there would still be a couple of key points to emphasize in this context. One is that LLMs are always being evaluated in terms of their ability to answer questions that humans have already answered. They are not generating new knowledge. We will always need humans to do that. The other point is the challenge that this book focuses on. If the only questions that an LLM cannot reliably answer are ones that are hard or impossible for graduate students, then LLMs will be able to produce passable answers to typical questions posed to K-12 and undergraduate students. To teach students at these levels, we cannot simply keep asking them to provide answers of the sort we always have, in the types of assignments we always have. Yet they need to learn in the same ways we have always sought to get them to learn. It is a conundrum, but not an intractable one. The majority of this book explores ways you can continue to foster the kind of student learning you have sought to, and do so not only as effectively but more effectively than before.

Resisting the AI Fictions of Science Fiction

As the COVID-19 pandemic forced educators to teach online, a number of people noted the uncanny similarity to the imagery and technological capabilities of “the future” presented in old sci-fi comic strips and TV shows (The Jetsons, for example). In these depictions, “going to school” meant looking at a screen and participating in a virtual classroom. It is safe to say that we are there right now: students can indeed go to school online. In addition, we have computers that can hold a conversation with us. We depend on automated tasks in many of our everyday activities. Some of us are secretly glad that our brand new car can parallel park for us.

Living as we do in an AI-enabled world, it is natural to ask whether we have arrived at a version of the future envisaged by science fiction. Perhaps a better question is which version of the future we are now living in. Science fiction has presented a whole array of possible futures and parallel universes, both utopian and dystopian. Some have been wildly speculative; others have been rooted in what seemed plausible to scientists. Because a lot of the hype around AI is informed by science fiction, before returning to the real world, let’s spend a few paragraphs discussing what people may assume (or hope or fear) we have actually created or achieved in terms of anticipated future technologies.

For example, it might seem that we are on the cusp of creating Rosie, the robot maid from The Jetsons.^{^[18]} In terms of today’s technology, Rosie seems like a combination of ChatGPT, a Roomba, and Amazon’s virtual assistant, Alexa. One could certainly envisage a robot vacuum that can also have conversations by way of a large language model (LLM). However, an LLMs’ ability to respond in a computer chat or even verbally does not mean that if you put an LLM in a robot body, it would be able to function as a fully autonomous agent. An LLM has no idea what words mean nor that they correspond to objects in a real world. Placing it in a robot body doesn’t instantly change that. Rosie does not appear to have these limitations (if her interactions with the Jetson family are anything to go by) although she very clearly has others.

An LLM may also seem similar to the computer on the Starship Enterprise in Star Trek: The Next Generation. Indeed, in some ways an LLM seems more capable of conversation. The computers on Star Trek from its beginning through the Next Generation featured a voice interface that could answer questions in a very constrained and mechanical manner. However, if the generative AI chatbot Claude were installed on the Enterprise, it would be far more conversational. On the other hand, there would need to be a human fact-checker onboard as well to verify its outputs. This is because an LLM has no mechanism that guarantees the text it generates will contain accurate information. It is possible, and perhaps even highly likely, for an LLM to make factual statements in response to questions about topics well represented in the human text it was trained on. There is, however, no guarantee, and the fact that it is first and foremost an imitator of human speech means that it may make something up and offer it confidently as the answer to your question. Even given this limitation, however, having an LLM on a starship might be more useful than the alternative: a crew member always standing ready to conduct something like an Internet search on the Enterprise’s computer. Having a chatbot instead that can interact by voice and provide mostly correct information much of the time might still be a better option. In emergencies, you may need an answer that is probably correct and need it fast.

Now, you are probably asking yourself why we can’t have a computer like the one in Star Trek. No one is saying that we can’t, only that LLMs are not what that show envisaged, nor indeed are they quite like anything that science fiction has envisaged. LLMs can interact with us, and in that sense they might seem like or even better than the AI in a lot of television and film. Nevertheless, LLMs have a key limitation, as we have already indicated, resulting from the fact that they merely imitate human speech. They are not minds that engage in human speech the way that humans do. Moreover, every instance of AI, be it generative (ChatGPT, Dall-E, Nightcafe, Sona, et cetera) or search based (whether Google, Bing, or a library catalog or database), is designed to perform one highly specific task. Our current AI technology cannot act as humans do for general or unspecified problems.

One might wonder, then, how far away we are from artificial general intelligence (AGI), which we defined earlier as “the capacity of an AI system to perform generalized and diverse tasks in the real world credibly enough to rival that of human intelligence.” Here, we emphasize that diverse is a key differentiator between what we currently have and what we fear (or aspire toward) from science fiction. A machine, tool, or AI designed to do one task very well (such as play chess) is fundamentally distinct from AI that can solve any general problem and, moreover, can do so in a high-fidelity manner. Generative AI technology might give a layperson pause or convince them we’ve achieved AGI, but as we have already explained, this impression quickly evaporates when you understand its underlying technology.

In fact, even computer scientists don’t know exactly when AGI might appear; estimates range from one to two years all the way up to a century or more. Some even believe that such a goal is impossible to achieve.^{^[19]} The results of a recent (unpublished) work that polled active computer science researchers at a 2024 conference indicate that about one-third of respondents thought we are more than twenty years from achieving AGI, another third within twenty years, and the final third within ten years.^{^[20]} In other words, even for those who are building the technology, it is very difficult to anticipate the future. Interpret that data point as you like, keeping in mind that similar estimates have been offered in the past without their hopes and expectations coming close to fulfilment!

Part of the reason for the uncertainty reflects confusion about the definition of AGI itself. Defining the terminology around an as-yet unachieved goal is challenging. For example, some of the claims that we will have full AGI “soon” are not entirely what they seem. As any good humanities professor knows, one must evaluate sources of information critically. The technology news (or noise, as it often turns out to be) is no different. Elon Musk, for example, has claimed AGI is on the horizon, perhaps even as early as 2029, but that the AI won’t be “useful or interesting” without human involvement.^{^[21]} From a computer science perspective, this means that the AGI Musk referred to isn’t truly AGI! An AGI that needs to check its answers with a human doesn’t really satisfy the fantasy that we crave (or fear). Furthermore, market-driven, self-declared tech geniuses have ulterior motives when making statements both about the nature and potential of AI. Take these claims with a grain of salt, just as computer science researchers do.

Nowadays, it is fascinating to read a book like Joseph E. Aoun’s Robot-Proof: Higher Education in the Age of Artificial Intelligence, which was written not that long before the release of ChatGPT.^{^[22]} From our perspective, it is clear that Aoun was right in highlighting precisely the things that make humans and human learning unique, even and perhaps especially in the era of AI. It is also clear that a book published less than a decade ago may not offer any concrete practical suggestions for dealing with current technology. Similarly, the book you are reading now may seem dated a decade from now, or it may be that all the concrete suggestions and specific recommendations are as applicable as ever. We cannot know which of those will be the case for certain. We nevertheless write it with absolute confidence that the principles and approach articulated in this book will remain true, because they are informed not just by the latest developments in computing or the humanities, but by a deep understanding of the technologies under discussion and of human learning. We also wrote this knowing that however transitory or durable any specific assignment types and activities may be, there is a need for this book now, and it will help educators for however long these observations prove useful. We expect them to be useful in the long term, but we are absolutely certain they are useful now. That is enough for us; we hope it is for you too.

Priten Shah, AI and the Future of Education: Teaching in the Age of Artificial Intelligence (Wiley, 2023), offers a big-picture view of focus and approach but few specifics about assignment redesign. ↵
Although the authors are technophiles with expertise and experience using the latest technology, they also both remember a time when watches made this sound. Hopefully this expression is still meaningful even to readers who have always only used their phone to check what time it is. ↵
That figure is a gross underestimate: Even if we used every computer, phone, and abacus on Earth at once, the time required would exceed 10127 years, which is 1 followed by 127 zeros. For reference, the number of atoms in the observable universe is “only” 1082. ↵
At least, that’s how it felt when my daughter chose to speak. My spouse and I were very tired. One might argue we still are, even though it has been many years since those days have passed. (This is one of the places where we considered specifying which of us was speaking, until the other author confirmed that he also had a similar experience and is equally tired.) ↵
There is a helpful discussion of the differences between machine and human learning in the Santa Fe Institute podcast, “Episode 4: Babies vs. Machines.” https://www.santafe.edu/culture/podcasts/episode-4-babies-vs-machines ↵
We tried and we failed, not because it is impossible to avoid anthropomorphizing LLMs, but because the only way to forego these metaphors and analogies entirely is to use highly technical computer science jargon in its place, which we assumed most readers don’t want. If you crave and/or understand this jargon, you probably don’t need this chapter and can skip ahead. ↵
Emily M. Bender and Alex Hannah, The AI Con: How to Fight Big Tech’s Hype and Create the Future We Want (Bodley Head, 2025), pp.28-31 provides a helpful explanation of why people mistakenly think LLMs must be sentient. In short, it uses “I” and imitates the speech we associate with other people, and this creates the illusion that what is happening behind the scenes must be akin to what happens in human minds. ↵
It’s clear that I must be hungry while writing this. It is also presumably clear from the lack of capitalization that this reference is to a sound-making device used to summon people when tacos are available, and not to any trademarked restaurant name or anything else of that sort. Legal disaster averted! ↵
For the quantity of textual data used see Tom B. Brown et al., “Language Models Are Few-Shot Learners, arXiv.org, May 28, 2020, https://arxiv.org/abs/2005.14165, p.8. We have already explained what GPT stands for. Adding numbers afterwards indicates successive models and updates. If the amount 570 GB isn’t meaningful to you, you have the authors’ permission (assuming you are reading a copy of this book that you own) to cross that out and write in “HUGE” instead. Also note that we have numerous references to papers on arxiv.org, where it is the practice for scientists to share working papers. This is the norm in the sciences including computer science but may be unfamiliar to humanities professors. This is the best place to find the absolute latest work being done on the study of artificial intelligence, among other things. ↵
Quite the catchy song for the kiddos, but useless to a doctor. ↵
Brute force is a technical term in computer science, not totally unrelated from its wider meaning. If you’re curious about the distinction, you can look it up! ↵
AI and AI system are used interchangeably in this book, because artificial intelligence is a characteristic which some systems have. Or, you know, because you’d get bored if we didn’t change things up occasionally. ↵
At the time this book went to print, you could download it here: https://github.com/deepseek-ai/DeepSeek-V3. You may still be able to, but these authors were not able to predict the future stability of this download link. ↵
On this, as pertains to the potential of LLMs to write academic articles, see Edisa Lozić and Benjamin Štular, “Fluent but Not Factual: A Comparative Analysis of ChatGPT and Other AI Chatbots’ Proficiency and Originality in Scientific Writing for Humanities,” Future Internet 2023 15, no. 10 (2023): 336, https://doi.org/10.3390/fi15100336. ↵
I have a colleague in Philosophy who insists that BS, in its technical sense used in his field, is indeed the most appropriate term. If you are a philosopher you may want to discuss this with your students. See the use of the term in Arvind Narayanan and Sayash Kapoor, AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference (Princeton University Press, 2024), pp.139-141, 197. ↵
Teppo Felin and Matthias Holweg, “AI Thinks Differently than People Do: Here’s Why That Matters," Harvard Business Review, December 10, 2024, https://hbr.org/2024/12/ai-thinks-differently-than-people-do-heres-why-that-matters. ↵
David Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark,” arXiv.org, November 20, 2023, https://arxiv.org/abs/2311.12022. ↵
When the character first appeared her name was spelled Rosey. Yes, the authors care about Jetsons trivia. ↵
We will use our authors’ privilege in this case to dodge clarifying exactly where our beliefs lie on the AGI arrival spectrum. In this way, we can both placate any upcoming AI overlords and simultaneously not seem foolish if it never comes to pass. It’s an intellectual insurance policy, if you will, and one that we got for a very reasonable price! ↵
Tao Feng et al., “How Far Are We from AI?” arXiv.org, May 16, 2024, https://arxiv.org/pdf/2405.10313v1. ↵
The source is a tweet, still there at the time of writing. It is quoted in many places including "Elon Musk says AI will be smarter than any human next year" Economic Times March 13th, 2024 https://economictimes.indiatimes.com/tech/tech-bytes/elon-musk-says-ai-will-be-smarter-than-any-human-next-year/articleshow/108463055.cms?from=mdr. ↵
Joseph E. Aoun, Robot-Proof: Higher Education in the Age of Artificial Intelligence (MIT Press, 2017). ↵

License

Icon for the Creative Commons Attribution-NoDerivatives 4.0 International License

Real Intelligence: Teaching in the Era of Generative AI by James F. McGrath and Ankur Gupta is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License, except where otherwise noted.