CC 315 Textbook

This is the textbook for CC 315.

Subsections of Homepage

Welcome!

This page is the main page for the introductory material.

Subsections of Introductory Material

CC 315 Syllabus

CC 315 - Data Structures & Algorithms II

Previous Versions

Instructor Contact Information

  • Instructor: Safia Malallah (safia AT ksu DOT edu) _I use she/her pronouns. If you need assistance, please send an email to safia @ ksu.edu, cc’ing cis115-help@KSUemailProd.onmicrosoft.com . Avoid using the Canvas email function, as emailing this way ensures that your message reaches all instructors for this course, guaranteeing a swifter response. You can expect a response by the end of the next business day; feel free to reach out again if you don’t receive a response within 24 business hours.

  • Office: DUE 2161

  • Virtual Office Hours: Schedule a meeting with me: https://calendly.com/safiamalallah Appointments held via Zoom .

Graduate Teaching Assistants

  • TBD

Prerequisites

  • CC 310 - Data Structures & Algorithms I

Course Overview

Advanced data structures and related algorithms. Formal software development methods and software engineering fundamentals. Introduction to requirements analysis processes that provide the specification of algorithmic requirements.

Course Description

This course introduces advanced data structures, such as trees, graphs, and heaps. Several new algorithms using these data structures are covered. Students also learn software development methods and software engineering fundamentals and use those skills to develop projects of increasing size and scope effectively.

Major Course Topics

  • Data Structures
    • Trees
    • Graphs
    • Heaps
  • Algorithms
    • Graph Searching
    • Shortest Path
    • Minimal Spanning Tree
  • Requirements Analysis
  • Application to Domain Areas

Course Structure

These courses are being taught 100% online, and each module is self-paced. There may be some bumps in the road as we refine the overall course structure. Students will work at their own pace through a set of modules, with approximately one module being due each week. Material will be provided in the form of recorded videos, online tutorials, links to online resources, and discussion prompts. Each module will include a coding project or assignment, many of which will be graded automatically through Codio. Assignments may also include portions which will be graded manually via Canvas or other tools.

The assigments for this course is delivered through Python programming language.

All activities are individual effort. ‘Group’ work is not permitted.

For a Better Score

Students are granted the opportunity to

  • Student can have the chance to redo One coding project for a better score.
    • Upon selecting the project for revision, students are required to send an email to the instructor.
    • A 24-hour window is provided for students to resubmit the improved work.
    • Requests for project redo will be accepted until week 15 of the course.
  • One late penalty will be dropped

Codio

You are expected to use the Codio Editor. It is deliberately feature poor to place the burden of programming syntax, vocabulary and logic flow on the student. DO NOT CUT AND PASTE into the editor while coding your projects.

Exception, you may cut and paste from Codio, so if you accidently delete the starter code, or want to modify a section of code you made in the tutorial. INCLUDE a comment stating from where you ‘sourced’ the pasted section.

from Codio tutorial 3.4.P.7 with open(sys.argv[1]) as input_file:

try:     

    reader = input_file.readlines()

except:

    reader = sys.stdin

Grading

In theory, each student begins the course with an A. As you submit work, you can either maintain your A (for good work) or chip away at it (for less adequate or incomplete work). In practice, each student starts with 0 points in the gradebook and works upward toward a final point total earned out of the possible number of points. In this course, each assignment constitutes a portion of the final grade, as detailed below:

  • 70% - Codio Programming Projects
  • 30% - Codio Tutorials and Quizzes

Up to 5% of the total grade in the class is available as extra credit. See the Extra Credit - Bug Bounty assignment for details.

Letter grades will be assigned following the standard scale:

  • 90% - 100% → A
  • 80% - 89.99% → B
  • 70% - 79.99% → C
  • 60% - 69.99% → D
  • 00% - 59.99% → F

How Project Grades are Assigned

The Projects you submit in Codio may have both an automatic and manual grading component.

Automatic Components: Codio automatically checks certain aspects of your code’s structure and functionality. The grade assigned by Codio is generally the ceiling of the score you can receive.

Manual Components: Once submitted your code may be reviewed and deductions take for:

Forbidden statements/Required statements: Some assignments may prohibit the use of certain built in functions and methods. Others may require the use of certain libraries and methods. Using prohibited statements and or skipping required statements will result in a ZERO for the project.

Cutting and Pasting into the Codio IDE (Project assignments): You are expected to do your work in Codio. Cutting and pasting from outside Codio is prohibited, may result in a zero and will trigger a closer plagiarism review.

Running from the terminal: Codio does not typically check for proper terminal input behavior. A project that does not compile and/or run from the terminal will result in a 25% reduction in your final score. The project should work for the sample inputs, and should not crash or print extraneous information.

Having extra methods/missing required methods: up to 10% deduction

Style Guide: see Pages

Late Work

Warning

Read this late work policy very carefully! If you are unsure how to interpret it, please contact the instructors via the help email. Not understanding the policy does not mean that it won’t apply to you!

Since this course is entirely online, students may work at any time and at their own pace through the modules. However, to keep everyone on track, there will be approximately one module due each week. Each graded item in the module will have a specific due date specified. All of the components of a module will be subject to the late policy if the module is submitted late. This penalty will be assessed via a single separate assignment entry in the gradebook, containing the sum of all grade reductions in the course for that student. Late assignments will not be accepted if they have been submitted more than 3 days after the due date, unless a special reason is provided and prior approval for an extension has been obtained before the due date.

For the purposes of recordkeeping, the submission time of the confirmation quiz in each module will be used to establish the completion time of the entire module in case of a discrepancy. This is because Codio may update submission times when assignments are regraded, but the quiz in Canvas should only be completed once.

However, even if a module is not submitted on time, it must still be completed before a student is allowed to begin the next module. So, students should take care not to get too far behind, as it may be very difficult to catch up.

Finally, all course work must be submitted on or before the last day of the semester in which the student is enrolled in the course in order for it to be graded on time.

If you have extenuating circumstances, please discuss them with the instructor as soon as they arise so other arrangements can be made. If you find that you are getting behind in the class, you are encouraged to speak to the instructor for options to make up missed work.

Incomplete Policy

Students should strive to complete this course in its entirety before the end of the semester in which they are enrolled. However, since retaking the course would be costly and repetitive for students, we would like to give students a chance to succeed with a little help rather than immediately fail students who are struggling.

If you are unable to complete the course in a timely manner, please contact the instructor to discuss an incomplete grade. Incomplete grades are given solely at the instructor’s discretion. See the official K-State Grading Policy for more information. In general, poor time management alone is not a sufficient reason for an incomplete grade.

Unless otherwise noted in writing on a signed Incomplete Agreement Form , the following stipulations apply to any incomplete grades given in Computational Core courses:

  1. Students may receive at most two incompletes in Computational Core courses throughout their time in the program
  2. Students will be given 6 calendar weeks from the end of the enrolled semester’s finals week to complete the course
  3. Any modules in a future CC course which depend on incomplete work will not be accessible until the previous course is finished
  4. For example, if a student is given an incomplete in CC 210, then all modules in CC 310 will be inaccessible until CC 210 is complete
  5. Students understand that access to instructor and GTA assistance may be limited after the end of an academic semester due to holidays and other obligations
  6. If a student fails to resolve an incomplete grade after 6 weeks, they will be assigned an ‘F’ in the course. In addition, they will be dropped from any other Computational Core courses which require the failed course as a prerequisite or corequisite.

Authorized Aid

All graded work is individual effort. You are authorized to use:

course’s materials, direct web-links from this course the appropriate languages documentation (https://docs.python.org/3/ or https://docs.oracle.com/javase/Links to an external site.) Email help received through 315 help email, CC - Instructors, GTAs Zoom/In-person help received from Instructors or GTA ACM help session (an on campus only resource) Most Tuesdays in EH 1116, 6:30PM. Tutors from the Academic Assistance Center or provided by K-State Athletics Use of on-line solutions whether for reference or code is prohibited. Use of previous semester’s answers, whether your own or another student’s is prohibited. Use of code-completion/suggestion tool’s, other than those we have installed in the Codio editor, is prohibitied.

Work Load

CC-315 is a three-hour course; most students find it more time consuming than CC-310.

Modules have been assigned so that the average amount of work (hours), based on past student performance, is somewhat consistent.

You may work ahead, but modules must be worked strictly in order. Also, if you are far ahead, it may take a bit longer to get help. We review the lessons every week so the support you get is not ’too advanced’; if you are pretty far ahead, we will need to ‘catch up’ to where you are to provide support.

Please note, modules are not equally weighted, some are worth far more than others.

To participate in this course, students must have access to a modern web browser and broadband internet connection. All course materials will be provided via Canvas and Codio. Modules may also contain links to external resources for additional information, such as programming language documentation.

Subject to Change

The details in this syllabus are not set in stone. Due to the flexible nature of this class, adjustments may need to be made as the semester progresses, though they will be kept to a minimum. If any changes occur, the changes will be posted on the Canvas page for this course and emailed to all students.

Standard Syllabus Statements

Info

The statements below are standard syllabus statements from K-State and our program. The latest versions are available online here .

Academic Honesty

Kansas State University has an Honor and Integrity System based on personal integrity, which is presumed to be sufficient assurance that, in academic matters, one’s work is performed honestly and without unauthorized assistance. Undergraduate and graduate students, by registration, acknowledge the jurisdiction of the Honor and Integrity System. The policies and procedures of the Honor and Integrity System apply to all full and part-time students enrolled in undergraduate and graduate courses on-campus, off-campus, and via distance learning. A component vital to the Honor and Integrity System is the inclusion of the Honor Pledge which applies to all assignments, examinations, or other course work undertaken by students. The Honor Pledge is implied, whether or not it is stated: “On my honor, as a student, I have neither given nor received unauthorized aid on this academic work.” A grade of XF can result from a breach of academic honesty. The F indicates failure in the course; the X indicates the reason is an Honor Pledge violation.

For this course, a violation of the Honor Pledge will result in sanctions such as a 0 on the assignment or an XF in the course, depending on severity. Actively seeking unauthorized aid, such as posting lab assignments on sites such as Chegg or StackOverflow, or asking another person to complete your work, even if unsuccessful, will result in an immediate XF in the course.

This course assumes that all your course work will be done by you. Use of AI text and code generators such as ChatGPT and GitHub Copilot in any submission for this course is strictly forbidden unless explicitly allowed by your instructor. Any unauthorized use of these tools without proper attribution is a violation of the K-State Honor Pledge .

We reserve the right to use various platforms that can perform automatic plagiarism detection by tracking changes made to files and comparing submitted projects against other students’ submissions and known solutions. That information may be used to determine if plagiarism has taken place.

Students with Disabilities

At K-State it is important that every student has access to course content and the means to demonstrate course mastery. Students with disabilities may benefit from services including accommodations provided by the Student Access Center. Disabilities can include physical, learning, executive functions, and mental health. You may register at the Student Access Center or to learn more contact:

Students already registered with the Student Access Center please request your Letters of Accommodation early in the semester to provide adequate time to arrange your approved academic accommodations. Once SAC approves your Letter of Accommodation it will be e-mailed to you, and your instructor(s) for this course. Please follow up with your instructor to discuss how best to implement the approved accommodations.

Expectations for Conduct

All student activities in the University, including this course, are governed by the Student Judicial Conduct Code as outlined in the Student Governing Association By Laws , Article V, Section 3, number 2. Students who engage in behavior that disrupts the learning environment may be asked to leave the class.

Mutual Respect and Inclusion in K-State Teaching & Learning Spaces

At K-State, faculty and staff are committed to creating and maintaining an inclusive and supportive learning environment for students from diverse backgrounds and perspectives. K-State courses, labs, and other virtual and physical learning spaces promote equitable opportunity to learn, participate, contribute, and succeed, regardless of age, race, color, ethnicity, nationality, genetic information, ancestry, disability, socioeconomic status, military or veteran status, immigration status, Indigenous identity, gender identity, gender expression, sexuality, religion, culture, as well as other social identities.

Faculty and staff are committed to promoting equity and believe the success of an inclusive learning environment relies on the participation, support, and understanding of all students. Students are encouraged to share their views and lived experiences as they relate to the course or their course experience, while recognizing they are doing so in a learning environment in which all are expected to engage with respect to honor the rights, safety, and dignity of others in keeping with the K-State Principles of Community .

If you feel uncomfortable because of comments or behavior encountered in this class, you may bring it to the attention of your instructor, advisors, and/or mentors. If you have questions about how to proceed with a confidential process to resolve concerns, please contact the Student Ombudsperson Office . Violations of the student code of conduct can be reported using the Code of Conduct Reporting Form . You can also report discrimination, harassment or sexual harassment , if needed.

Netiquette

Info

This is our personal policy and not a required syllabus statement from K-State. It has been adapted from this statement from K-State Global Campus, and theRecurse Center Manual . We have adapted their ideas to fit this course.

Online communication is inherently different than in-person communication. When speaking in person, many times we can take advantage of the context and body language of the person speaking to better understand what the speaker means, not just what is said. This information is not present when communicating online, so we must be much more careful about what we say and how we say it in order to get our meaning across.

Here are a few general rules to help us all communicate online in this course, especially while using tools such as Canvas or Discord:

  • Use a clear and meaningful subject line to announce your topic. Subject lines such as “Question” or “Problem” are not helpful. Subjects such as “Logic Question in Project 5, Part 1 in Java” or “Unexpected Exception when Opening Text File in Python” give plenty of information about your topic.
  • Use only one topic per message. If you have multiple topics, post multiple messages so each one can be discussed independently.
  • Be thorough, concise, and to the point. Ideally, each message should be a page or less.
  • Include exact error messages, code snippets, or screenshots, as well as any previous steps taken to fix the problem. It is much easier to solve a problem when the exact error message or screenshot is provided. If we know what you’ve tried so far, we can get to the root cause of the issue more quickly.
  • Consider carefully what you write before you post it. Once a message is posted, it becomes part of the permanent record of the course and can easily be found by others.
  • If you are lost, don’t know an answer, or don’t understand something, speak up! Email and Canvas both allow you to send a message privately to the instructors, so other students won’t see that you asked a question. Don’t be afraid to ask questions anytime, as you can choose to do so without any fear of being identified by your fellow students.
  • Class discussions are confidential. Do not share information from the course with anyone outside of the course without explicit permission.
  • Do not quote entire message chains; only include the relevant parts. When replying to a previous message, only quote the relevant lines in your response.
  • Do not use all caps. It makes it look like you are shouting. Use appropriate text markup (bold, italics, etc.) to highlight a point if needed.
  • No feigning surprise. If someone asks a question, saying things like “I can’t believe you don’t know that!” are not helpful, and only serve to make that person feel bad.
  • No “well-actually’s.” If someone makes a statement that is not entirely correct, resist the urge to offer a “well, actually…” correction, especially if it is not relevant to the discussion. If you can help solve their problem, feel free to provide correct information, but don’t post a correction just for the sake of being correct.
  • Do not correct someone’s grammar or spelling. Again, it is not helpful, and only serves to make that person feel bad. If there is a genuine mistake that may affect the meaning of the post, please contact the person privately or let the instructors know privately so it can be resolved.
  • Avoid subtle -isms and microaggressions. Avoid comments that could make others feel uncomfortable based on their personal identity. See the syllabus section on Diversity and Inclusion above for more information on this topic. If a comment makes you uncomfortable, please contact the instructor.
  • Avoid sarcasm, flaming, advertisements, lingo, trolling, doxxing, and other bad online habits. They have no place in an academic environment. Tasteful humor is fine, but sarcasm can be misunderstood.

As a participant in course discussions, you should also strive to honor the diversity of your classmates by adhering to the K-State Principles of Community .

Discrimination, Harassment, and Sexual Harassment

Kansas State University is committed to maintaining academic, housing, and work environments that are free of discrimination, harassment, and sexual harassment. Instructors support the University’s commitment by creating a safe learning environment during this course, free of conduct that would interfere with your academic opportunities. Instructors also have a duty to report any behavior they become aware of that potentially violates the University’s policy prohibiting discrimination, harassment, and sexual harassment, as outlined by PPM 3010 .

If a student is subjected to discrimination, harassment, or sexual harassment, they are encouraged to make a non-confidential report to the University’s Office for Institutional Equity (OIE) using the online reporting form . Incident disclosure is not required to receive resources at K-State. Reports that include domestic and dating violence, sexual assault, or stalking, should be considered for reporting by the complainant to the Kansas State University Police Department or the Riley County Police Department . Reports made to law enforcement are separate from reports made to OIE. A complainant can choose to report to one or both entities. Confidential support and advocacy can be found with the K-State Center for Advocacy, Response, and Education (CARE) . Confidential mental health services can be found with Lafene Counseling and Psychological Services (CAPS) . Academic support can be found with the Office of Student Life (OSL) . OSL is a non-confidential resource. OIE also provides a comprehensive list of resources on their website. If you have questions about non-confidential and confidential resources, please contact OIE at equity@ksu.edu or (785) 532–6220.

Academic Freedom Statement

Kansas State University is a community of students, faculty, and staff who work together to discover new knowledge, create new ideas, and share the results of their scholarly inquiry with the wider public. Although new ideas or research results may be controversial or challenge established views, the health and growth of any society requires frank intellectual exchange. Academic freedom protects this type of free exchange and is thus essential to any university’s mission.

Moreover, academic freedom supports collaborative work in the pursuit of truth and the dissemination of knowledge in an environment of inquiry, respectful debate, and professionalism. Academic freedom is not limited to the classroom or to scientific and scholarly research, but extends to the life of the university as well as to larger social and political questions. It is the right and responsibility of the university community to engage with such issues.

Campus Safety

Kansas State University is committed to providing a safe teaching and learning environment for student and faculty members. In order to enhance your safety in the unlikely case of a campus emergency make sure that you know where and how to quickly exit your classroom and how to follow any emergency directives. Current Campus Emergency Information is available at the University’s Advisory webpage.

Student Resources

K-State has many resources to help contribute to student success. These resources include accommodations for academics, paying for college, student life, health and safety, and others. Check out the Student Guide to Help and Resources: One Stop Shop for more information.

Student Academic Creations

Student academic creations are subject to Kansas State University and Kansas Board of Regents Intellectual Property Policies. For courses in which students will be creating intellectual property, the K-State policy can be found at University Handbook, Appendix R: Intellectual Property Policy and Institutional Procedures (part I.E.) . These policies address ownership and use of student academic creations.

Mental Health

Your mental health and good relationships are vital to your overall well-being. Symptoms of mental health issues may include excessive sadness or worry, thoughts of death or self-harm, inability to concentrate, lack of motivation, or substance abuse. Although problems can occur anytime for anyone, you should pay extra attention to your mental health if you are feeling academic or financial stress, discrimination, or have experienced a traumatic event, such as loss of a friend or family member, sexual assault or other physical or emotional abuse.

If you are struggling with these issues, do not wait to seek assistance.

For Kansas State Salina Campus:

For Global Campus/K-State Online:

  • K-State Online students have free access to mental health counseling with My SSP - 24/7 support via chat and phone.
  • The Office of Student Life can direct you to additional resources.

University Excused Absences

K-State has a University Excused Absence policy (Section F62) . Class absence(s) will be handled between the instructor and the student unless there are other university offices involved. For university excused absences, instructors shall provide the student the opportunity to make up missed assignments, activities, and/or attendance specific points that contribute to the course grade, unless they decide to excuse those missed assignments from the student’s course grade. Please see the policy for a complete list of university excused absences and how to obtain one. Students are encouraged to contact their instructor regarding their absences.

© The materials in this online course fall under the protection of all intellectual property, copyright and trademark laws of the U.S. The digital materials included here come with the legal permissions and releases of the copyright holders. These course materials should be used for educational purposes only; the contents should not be distributed electronically or otherwise beyond the confines of this online course. The URLs listed here do not suggest endorsement of either the site owners or the contents found at the sites. Likewise, mentioned brands (products and services) do not suggest endorsement. Students own copyright to what they create.

Subsections of CC 315 Syllabus

Welcome!

This page is the main page for the Strings and StringBuilders Section

Subsections of Strings and StringBuilders

Welcome!

This page is the main page for the Strings and StringBuilders chapter

Subsections of Strings and StringBuilders

Introduction

YouTube Video

Video Materials

In CC310 we covered various data structures: stacks, sets, lists, queues, and hash tables. When we looked at these structures, we considered how to access elements within the structures, how we would create our own implementation of the structure, and tasks that these structures would be fitting for as well as ill fitting. Throughout this course we will introduce and implement a variety of data structures as we did in CC310.

We begin this course with an often overlooked structure: strings. By the end of this chapter, we will understand how strings are data structures, how we access elements, what types of tasks are appropriate for strings, and how we can improve on strings in our code.

Motivation

In many data science positions, programmers often work with text-based data. Some examples of text-based data include biology for DNA sequencing, social media for sentiment classification, online libraries for citation networks, and many other types of businesses for data analytics. Currently, strings are often used for word embeddings, which determine how similar or dissimilar words are to one another. An industry leading software for this application is Tensorflow for Python, which generated the image below.

Word Embeddings Near ‘Python’ Word Embeddings Near ‘Python’

In an immense oversimplification, the process used for word embeddings is to read in a large amount of text and then use machine learning algorithms to determine similarity by using the words around each word. This impacts general users like us in search engines, streaming services, dating applications, and much more! For example, if you were to search Python topics in your search results may appear referring to the coding language, the reptile, the comedy troupe, and many more. When we use machine learning to determine word meanings, it is important that the data is first parsed correctly and stored in efficient ways so that we can access elements as needed. Understanding how to work with strings and problems that can arise with them is important to utilizing text successfully.

Reference: https://projector.tensorflow.org/

Subsections of Introduction

Theory

YouTube Video

Video Materials

What are Strings?

Strings are data structures which store an ordered set of characters. Recall that a character can be a: letter, number, symbol, punctuation mark, or white space. Strings can contain any number and any combination of these. As such, strings can be single characters, words, sentences, and even more.

How do we work with Strings?

Let’s refresh ourselves on how strings work, starting with the example string: s = "Go Cats!".

Character G o C a t s !
Index 0 1 2 3 4 5 6 7

As with other data structures, we can access elements within the string itself. Using the example above, s[0] would contain the character ‘G’, s[1] would contain the character ‘o’, s[2] would contain the character ’ ‘, and so on.

We can determine the size of a string by using length functions; in our example, the length of s would be 8. It is also important to note that when dealing with strings, null is not equivalent to “”. For string s = "", the length of s would be 0. However for string s = null, accessing the length of s would return an error that null has no length property.

We can also add to strings or append on a surface level; though it is not that simple behind the scenes. The String class is immutable. This means that changes will not happen directly to the string; when appending or inserting, code will create a new string to hold that value. More concisely, the state of an immutable object cannot change.

We cannot assign elements of the string directly and more broadly for any immutable object. For example, if we wanted the string from our example to be ‘Go Cat!!’, we cannot assign the element through s[6] = '!'. This would result in an item assignment error.

Info

Try it!

For an example, consider string s = ‘abc’. If we then state in code s = s + ‘123’, this will create a new place in memory for the new definition of s. We can verify this in code by using the id() function.

string s = 'abc'
id(s)
Output: 140240213073680 #may be different on your personal device

string s = s + '123'
id(s)
Output: 140239945470168 

While on the surface it appears that we are working with the same variable, our code will actually refer to a different one. There are many other immutable data types as well as mutable data types.

Info

Mutable vs. Immutable

On the topic of immutable, we can also discuss the converse: mutable objects. Being a mutable object means that the state of the object can change. In CC310, we often worked with arrays to implement various data structures. Arrays are mutable, so as we add, change, or remove elements from an array, the array changes its state to accommodate the change rather than creating a new location in memory.

Data Type Immutable?
Lists
Sets
Byte Arrays
Dictionaries
Strings
Ints
Floats
Booleans

Reference: http://people.cs.ksu.edu/~rhowell/DataStructures/strings/strings.html

Subsections of Theory

Memory Example

Consider the following block of pseudocode:

1. function APPENDER(NUMBER, BASE)
2.     RESULT = ""
3.     loop I from 1 to NUMBER
4.         RESULT = RESULT + BASE
5.         if I MOD 2 = 0
6.             RESULT = RESULT + " "
7.         else
8.             RESULT = RESULT + ", " 
9.     end loop
10.    return RESULT
11. end function

Lets step through the function call with APPENDER(4,'abc') and analyze the memory that the code takes.

Info

Recall that strings are reference variables. As such, string variables hold pointers to values and the value is stored in memory. For the following example, the HEAP refers to what is currently stored in memory and VARIABLE shows the current value of the variable RESULT.

  • Initialization: In line two, we initialize RESULT as an empty string. In the heap, we have the empty string at memory location 0x1. Thus, RESULT is holding the pointer 0x1. Initialize Initialize

  • I = 1: Now we have entered the loop and on line 4, we add more characters to our string. At this point, we would have entry 0x2 in our heap and our variable RESULT would have the pointer 0x2. Continuing through the code, line 5 determines if I is divisible by 2. In this iteration I = 1, so we take the else branch. We again add characters to our string, resulting in a new entry in 0x3 and our variable RESULT containing the pointer 0x3. In total, we have written 8 characters. We then increment I and move to the next iteration. After 1st loop After 1st loop

  • I = 2: We continue the loop and on line 4, we add more characters to our string. At this point, we would have entry 0x4 in our heap and our variable RESULT would have the pointer 0x4. Continuing through the code, line 5 determines if I is divisible by 2. In this iteration I = 2, so we take the if branch. We again add characters to our string, resulting in a new entry in 0x5 and our variable RESULT containing the pointer 0x5. In this iteration, we have written 17 characters. We then increment I and move to the next iteration of the loop. After 2nd loop After 2nd loop

  • I = 3: We continue the loop and on line 4, we add more characters to our string. At this point, we would have entry 0x6 in our heap and our variable RESULT would have the pointer 0x6. Continuing through the code, line 5 determines if I is divisible by 2. In this iteration I = 3, so we take the if branch. We again add characters to our string, resulting in a new entry in 0x7 and our variable RESULT containing the pointer 0x7. In this iteration, we have written 26 characters. We then increment I and thus I = 4 breaking out of the loop. After 3rd loop After 3rd loop

We can do some further analysis of the memory that is required for this particular block.

Iteration Memory Entries Total Character Copies
1 3 8
2 5 8 + 17 = 25
3 7 25 + 26 = 51
4 9 51 + 35 = 86
. . .
n 2n + 1 (9n2 + 7n)/2

You need not worry about creating the equations! Based on this generalization, if the user wanted to do 100K iterations, say for gene sequencing, there would be (2x100,000 - 1) = 200,001 memory entries and (9x100,0002 + 7x100,000)/2 = 45 billion character copies. This behavior is not exclusive to strings; this will occur for any immutable type.

While this example is contrived, it is not too far off the mark. Another caveat to this analysis is that, depending on our programming language, there will be a periodic ‘memory collection’; there wont be 200K memory addresses occupied at one time. Writing to memory in this way can be costly in terms of time, which in industry is money.

Limitations Java

YouTube Video

Video Materials

As a result of being immutable, strings can be cumbersome to work with in certain applications. When long strings or strings that we are continually appending to, such as in the memory example, we end up creating a lot of sizable copies.

Recall from the memory example the block of pseudocode.

1. function APPENDER(NUMBER, BASE)
2.     RESULT = ""
3.     loop I from 1 to NUMBER
4.         RESULT = RESULT + BASE
5.         if I MOD 2 = 0
6.             RESULT = RESULT + " "
7.         else
8.             RESULT = RESULT + ", " 
9.     end loop
10.    return RESULT
11. end function

In this example, what if we changed RESULT to a mutable type, say a list of strings for Python or a StringBuilder in Java. Once the loop is done, we can cast RESULT to a string. By changing just the one aspect of the code, we would make only one copy of RESULT and have far less character copies.

Info

Java specifically has a StringBuilder class which was created for this precise reason.

Consider the following, and note the slight changes on lines 2, 4, 6, 8 and the additional line 10.

1. function APPENDER_LIST(NUMBER, BASE)
2.     RESULT = []
3.     loop I from 1 to NUMBER
4.         RESULT.APPEND(BASE) 
5.         if I MOD 2 = 0
6.             RESULT.APPEND(" ") 
7.         else
8.             RESULT.APPEND(", ")  
9.     end loop
10.    RESULT = "".JOIN(RESULT)
11.    return RESULT
12. end function

Now consider APPENDER_LIST(4,‘abc’)

  • Initialization: We start by initializing the empty array. RESULT will hold the pointer 0x1. Initialize Initialize

  • I = 1: Now we have entered the loop and on line 4, we add more characters to our array. At this point, we would have only entry 0x1 in our heap and our variable RESULT would have the pointer 0x1. Continuing through the code, line 5 determines if I is divisible by 2. In this iteration I = 1, so we take the else branch. We again add characters to our array. In total, we have written 5 characters. We then increment I and move to the next iteration.
    After 1st Loop After 1st Loop

  • I = 2: We continue the loop and on line 4, we add more characters to our array. We still have just one entry in memory and our pointer is still 0x1. In this iteration, we have written 4 characters. We then increment I and move to the next iteration of the loop. After 2nd Loop After 2nd Loop

  • I = 3: We continue the loop and on line 4, we add more characters to our array. In this iteration, we have written 5 characters. We then increment I and thus I = 4 breaking out of the loop. After 3rd Loop After 3rd Loop

  • Post-Loop: Once the loop breaks, we join the array to create the final string. This creates a new place in memory and changes RESULT to contain the pointer 0x2. After Loop Breaks After Loop Breaks

We can do some further analysis of the memory that is required for this particular block.

Iteration Memory Entries Character Copies
1 2 8
2 2 17
3 2 26
4 2 35
. . .
n 2 9n - 1

Again, you need not worry about creating these equations for this course. To illustrate the improvement even more explicitly, let’s consider our previous example with 100K iterations. For APPENDER there were (2x100,000 - 1) = 200,001 memory entries and (9x100,0002 + 7x100,000)/2 = 45 billion character copies. For APPENDER_LIST we now have just 2 memory entries and (9x100,000 - 1) = 899,999 character copies. This dramatic improvement was a result of changing our data structure ever so slightly.

Reference: http://people.cs.ksu.edu/~rhowell/DataStructures/strings/stringbuilders.html

Subsections of Limitations Java

Limitations Python

YouTube Video

Video Materials

As a result of being immutable, strings can be cumbersome to work with in certain applications. When working with long strings or strings that we are continually appending to, such as in the memory example, we end up creating a lot of sizable copies.

Recall from the memory example the block of pseudocode.

1. function APPENDER(NUMBER, BASE)
2.     RESULT = ""
3.     loop I from 1 to NUMBER
4.         RESULT = RESULT + BASE
5.         if I MOD 2 = 0
6.             RESULT = RESULT + " "
7.         else
8.             RESULT = RESULT + ", " 
9.     end loop
10.    return RESULT
11. end function

In this example, what if we changed RESULT to a mutable type, such as a StringBuilder in Java. Once the loop is done, we can cast RESULT to a string. By changing just the one aspect of the code, we would make only one copy of RESULT and have far less character copies.

Info

Java specifically has a StringBuilder class which was created for this precise reason.

Consider the following, and note the slight changes on lines 2, 4, 6, 8 and the additional line 10.

1. function APPENDER_LIST(NUMBER, BASE)
2.     RESULT = []
3.     loop I from 1 to NUMBER
4.         RESULT.APPEND(BASE) 
5.         if I MOD 2 = 0
6.             RESULT.APPEND(" ") 
7.         else
8.             RESULT.APPEND(", ")  
9.     end loop
10.    RESULT = "".JOIN(RESULT)
11.    return RESULT
12. end function

Now consider APPENDER_LIST(4,‘abc’)

  • Initialization: We start by initializing the empty array. RESULT will hold the pointer 0x1. Initialize Initialize

  • I = 1: Now we have entered the loop and on line 4, we add more characters to our array. At this point, we would have only entry 0x1 in our heap and our variable RESULT would have the pointer 0x1. Continuing through the code, line 5 determines if I is divisible by 2. In this iteration I = 1, so we take the else branch. We again add characters to our array. In total, we have written 5 characters. We then increment I and move to the next iteration.
    After 1st Loop After 1st Loop

  • I = 2: We continue the loop and on line 4, we add more characters to our array. We still have just one entry in memory and our pointer is still 0x1. In this iteration, we have written 4 characters. We then increment I and move to the next iteration of the loop. After 2nd Loop After 2nd Loop

  • I = 3: We continue the loop and on line 4, we add more characters to our array. In this iteration, we have written 5 characters. We then increment I and thus I = 4 breaking out of the loop. After 3rd Loop After 3rd Loop

  • Post-Loop: Once the loop breaks, we join the array to create the final string. This creates a new place in memory and changes RESULT to contain the pointer 0x2. After Loop Breaks After Loop Breaks

We can do some further analysis of the memory that is required for this particular block.

Iteration Memory Entries Character Copies
1 2 8
2 2 17
3 2 26
4 2 35
. . .
n 2 9n - 1

Again, you need not worry about creating these equations for this course. To illustrate the improvement even more explicitly, let’s consider our previous example with 100K iterations. For APPENDER there were (2x100,000 - 1) = 200,001 memory entries and (9x100,0002 + 7x100,000)/2 = 45 billion character copies. For APPENDER_LIST we now have just 2 memory entries and (9x100,000 - 1) = 899,999 character copies. This dramatic improvement was a result of changing our data structure ever so slightly.

Reference: http://people.cs.ksu.edu/~rhowell/DataStructures/strings/stringbuilders.html

Subsections of Limitations Python

Summary

YouTube Video

Video Materials

To start this course, we have looked into strings. They are a very natural way to represent data, especially in real world applications. Often though, the datapoints can be very large and require multiple modifications. We also examined how strings work: element access, retrieving the size, and modifying them. We looked into some alternatives which included StringBuilders for Java and character arrays for Python.

To really understand this point, we have included a comparison. We have implemented the APPENDER and APPENDER_LIST functions in both Python and Java. For the Java implementation, we utilized StringBuilders.

1. function APPENDER(NUMBER, BASE)
2.     RESULT = ""
3.     loop I from 1 to NUMBER
4.         RESULT = RESULT + BASE
5.         if I MOD 2 = 0
6.             RESULT = RESULT + " "
7.         else
8.             RESULT = RESULT + ", " 
9.     end loop
10.    return RESULT
11. end function
1. function APPENDER_LIST(NUMBER, BASE)
2.     RESULT = []
3.     loop I from 1 to NUMBER
4.         RESULT.APPEND(BASE) 
5.         if I MOD 2 = 0
6.             RESULT.APPEND(" ") 
7.         else
8.             RESULT.APPEND(", ")  
9.     end loop
10.    RESULT = "".JOIN(RESULT)
11.    return RESULT
12. end function
Info

For the tests of 108 and 109 in Java, the string implementation took over 24 hours and the StringBuilder implementation ran out of memory. For these reasons, they are omitted from the figure.

Python Time Python Time Java Time Java Time

These figures compare Strings and lists for Python and Strings and StringBuilders for Java. The intention of these is not to compare Python and Java.

In both languages, we see that the string function and the respective alternative performed comparably until approximately 106 (1,000,000 characters). Again, these are somewhat contrived examples with the intention of understanding side effects of using strings.

As we have discussed, modern coding languages will have clean up protocols and memory management strategies. With the intention of this class in mind, we will not discuss the memory analysis in practice.

When modifying strings we need to be cognizant of how often we will be making changes and how large those changes will be. If we are just accessing particular elements or only doing a few modifications then using plain strings is a reasonable solution. However, if we are looking to build our own DNA sequence this is not a good way to go as strings are immutable.

Subsections of Summary

Welcome!

This page is the main page for the Trees Section. This section has chapters which cover:

  1. Basic Trees
  2. Recursive Trees
  3. Tries
  4. Binary Trees

Subsections of Trees

Welcome!

This page is the main page for Introduction to Trees

Subsections of Introduction to Trees

Introduction

YouTube Video

Video Materials

For the next data structure in the course, we will cover trees, which are used to show hierarchical data. Trees can have many shapes and sizes and there is a wide variety of data that can be organized using them. Real world data that is appropriate for trees can include: family trees, management structures, file systems, biological classifications, anatomical structures and much more.


We can look at an example of a tree and the relationships they can show. Consider this file tree; it has folders and files in folders. File Tree File Tree

If we wanted to access the file elm.txt, we would have to follow this file path: Foo/B/Q/E/elm.txt. We can similarly store the file structure as a tree like we below. As before, if we wanted to get to the file elm.txt we would navigate the tree in the order: Foo -> B -> Q -> E -> elm.txt. As mentioned before, trees can be used on very diverse data sets; they are not limited to file trees! File Tree as Tree File Tree as Tree

What are trees?

In the last module we talked about strings which are a linear data structure. To be explicit, this means that the elements in a string form a line of characters. A tree, by contrast, is a hierarchal structure which is utilized best in multidimensional data. Going back to our file tree example, folders are not limited to just one file, there can be multiple files contained in a single folder- thus making it multidimensional.

Consider the string “abc123”; this is a linear piece of data where there is exactly one character after another. We can use trees to show linear data as well. String as Tree String as Tree

While trees can be used for linear data, it would be excessive and inefficient to implement them for single strings. In an upcoming module, we will see how we can use trees to represent any number of strings! For example, this tree below contains 7 words: ‘a’, ‘an’, ‘and’, ‘ant’, ‘any’, ‘all’, and ‘alp’. Small Words Small Words

In the next sections, we will discuss the properties of a tree data structure and how we would design one ourselves. Once we have a good understanding of trees and the properties of trees, we will implement our own.

Subsections of Introduction

General Terms

YouTube Video

Video Materials

To get ourselves comfortable in working with trees, we will outline some standard vocabulary. Throughout this section, we will use the following tree as a guiding example for visualizing the definitions.
Blank Blank

Definitions

  • Node - the general term for a structure which contains an item, such as a character or even another data structure.
  • Edge - the connection between two nodes. In a tree, the edge will be pointing in a downward direction.

Nodes and Edges Nodes and Edges This tree has five edges and six nodes. There is no limit to the number of nodes in a tree. The only stipulation is that the tree is fully connected. This means that there cannot be disjoint portions of the tree. We will look at examples in the next section.

Info

A rule of thumb for discerning trees is this: if you imagine holding the tree up by the root and gravity took effect, then all edges must be pointing downward. If an edge is pointing upward, we will have a cycle within our structure so it will not be a tree.


  • Root - the topmost node of the tree

Root Root To be a tree, there must be exactly one root. Having multiple roots, will result in a cycle or a tree that is not fully connected. In short, a cycle is a loop in the tree.


  • Parent - a node with an edge that connects to another node further from the root. We can also define the root of a tree with respect to this definition; Root: a node with no parent.
  • Child - a node with an edge that connects to another node closer to the root.
    • In a general tree, the children of a node are an unordered set. There is not a fixed or defined order for generic trees. Parent and Child Parent and Child Parent and Child Parent and Child In a tree, child nodes must have exactly one parent node. If a child node has more than one parent, then a cycle will occur. If there is a node without a parent node, then this is necessarily the root node. There is no limit to the number of child nodes a parent node can have, but to be a parent node, the node must have at least one child node.

  • Leaf - a node with no children. Leaf Leaf This tree has four leaves. There is no limit to how many leaves can be in a tree.

  • Degree
    • Degree of a node - the number of children a node has. The degree of a leaf is zero.
    • Degree of a tree - the number of children the root of the tree has. Degree Degree The degree of the nodes are shown as the values on the nodes in this example. The degree of the tree is equal to the degree of the root. Thus, the degree for this tree is 2.

  • A tree is defined recursively. This means that each child of the root is the root of another tree and the children of those are roots to trees as well. Again, this is a recursive definition so it will continue to the leaves. The leaves are also trees with just a single node as the root.
    Subtrees Subtrees In our example tree, we have six trees. Each tree is outlined in a red dashed circle:
    • the main tree,
      • the tree off the left of the main root,
        • the tree off the left of this root,
        • the tree in the center of this root,
        • the tree off the right of this root, and
      • the tree off the right of the main root with a single node

Subsections of General Terms

What Makes a Tree a Tree

YouTube Video

Video Materials

Trees can come in many shapes and sizes. There are, however some constraints to making a valid tree.

  • A tree has a single root
  • A child has exactly one parent
  • A tree is fully connected (IE a single tree)
  • A tree has no cycles (IE no loops)

Valid Trees

Any combination of the following represents a valid tree:

  • A tree with a single node; just a root, Single Single

  • A tree where each node has a single child, or Linear Linear

  • A tree where nodes have many children. Linear Linear

Invalid Trees

Below are some examples of invalid trees.

  • A cycle Cycle Cycle Again, cycles are essentially loops that occur in our tree. In this example, we see that our leaf has two parents. One way to determine whether your data structure has a cycle is if there is more than one way to get from the root to any node.

  • A cycle Cycle Cycle Here we can see another cycle. In this case, the node immediately after the root has two parents, which is a clue that a cycle exists. Another test

  • Two Roots Two Roots Two Roots Trees must have a single root. In this instance, it may look like we have a tree with two roots. Working through this, we also see that the node in the center has two parents.

  • Two Trees Double Tree Double Tree This example would be considered two trees, not a tree with two parts. In this figure, we have two fully connected components. Since they are not connected to each other, this is not a single tree.

Subsections of What Makes a Tree a Tree

MyTree I

YouTube Video

Video Materials

Along with understanding how trees work, we want to also be able to implement a tree of our own. We will now outline key components of a tree class.


MyTree

Recall that trees are defined recursively so we can build them from the leaves up where each leaf is a tree itself. Each tree will have three properties: the item it contains as an object, its parent node of type MyTree, and its children as a list of MyTrees. Upon instantiation of a new MyTree, we will set the item value and initialize the parent node to None and the children to an empty list of type MyTree. UML UML

Suppose that we wanted to construct the following tree. Tree Tree We would start by initializing each node as a tree with no parent, no children, and the item in this instance would be the characters. Then we build it up level by level by add the appropriate children to the respective parent.

Info

Disclaimer: This implementation will not prevent all cycles. In the next module, we will introduce steps to prevent cycles and maintain true tree structures.


Finding a child

In this method, we will take a value as input and then check if that value is the item of a child of the current node. If the value is not the item for any of the node’s children then we should return none.

function FINDCHILD(VALUE)
    FOR CHILD in CHILDREN
        IF CHILD's ITEM is VALUE
            return  CHILD
    return NONE
end function

Getting children, item, parent, or degree

Each of these will be rather straight forward; children, item, and parent are all attributes of our node, so we can have a getter function that returns the respective values. The slightly more involved task will be getting the degree. Recall that the degree of a node is equal to the number of children. Thus, we can simply count the number of children and return this number for the degree.


Checking node type

We will have two functions to check the node type: one to determine if the node is a leaf and one to determine if it is a root. The definition of a leaf is a node that has no children. Thus, to check if a node is a leaf, we can simply check if the number of children is equal to zero. Similarly, since the definition of a root is a node with no parent, we can check that the parent attribute of the node is None.


Adding a child

YouTube Video

When we wish to add a child, we must fisrt make sure we are able to add the child.

  1. Check that the child is an instance of MyTree
  2. Make sure the child doesn’t already have a parent
  3. Make sure the child isn’t already a child of the parent

We will return true if the child was successfully added and false otherwise while raising the appropriate errors.

function ADDCHILD(CHILD)
    IF CHILD has PARENT
        throw exception
    IF CHILD is CHILD of PARENT
        return FALSE
    ELSE
        append CHILD to PARENT's children
        set CHILD's parent to PARENT
        return TRUE
end function

As an example, lets walk through the process of building the tree above:

  1. Instantiate MyTree a with item ‘A’
  2. Instantiate MyTree b with item ‘B’
  3. Instantiate MyTree c with item ‘C’
  4. Instantiate MyTree d with item ‘D’
  5. Instantiate MyTree e with item ‘E’
  6. Instantiate MyTree f with item ‘F’
  7. Instantiate MyTree g with item ‘G’
  8. Instantiate MyTree h with item ‘H’
  9. Instantiate MyTree i with item ‘I’
  10. Add child tree g to tree d
  11. Add child tree h to tree d
  12. Add child tree i to tree d
  13. Add child tree e to tree b
  14. Add child tree f to tree b

Once we have completed that, visually, we would have the tree above and in code we would have:

  • MyTree a with parent_node = None, item = ‘A’, children = {b,c,d}
  • MyTree b with parent_node = a, item = ‘B’, children = {e,f}
  • MyTree c with parent_node = a, item = ‘C’, children = { }
  • MyTree d with parent_node = a, item = ‘D’, children = {g,h,i}
  • MyTree e with parent_node = b, item = ‘E’, children = { }
  • MyTree f with parent_node = b, item = ‘F’, children = { }
  • MyTree g with parent_node = d, item = ‘G’, children = { }
  • MyTree h with parent_node = d, item = ‘H’, children = { }
  • MyTree i with parent_node = d, item = ‘I’, children = { }

Note: When adding a child we must currently be at the node we want to be the parent. Much like when you want to add a file to a folder, you must specify exactly where you want it. If you don’t, this could result in a wayward child.


Removing a child

YouTube Video

In the case of removing a child, we first need to check that the child we are attempting to remove is an instance of MyTree. We will return true if we successfully remove the child and false otherwise.

function REMOVECHILD(CHILD)
    IF CHILD in PARENT'S children
        REMOVE CHILD from PARENT's children
        SET CHILD's PARENT to NONE
        return TRUE
    ELSE
        return FALSE
end function

As with adding a child, we need to ensure that we are in the ‘right place’ when attempting to remove a child. When removing a child, we are not ’erasing’ it, we are just cutting the tie from parent to child and child to parent. Consider removing d from a. Visually, we would have two disjoint trees, shown below: Tree 2 Tree 2

In code, we would have:

  • MyTree a with parent_node = None, item = ‘A’, children = {b,c}
  • MyTree b with parent_node = a, item = ‘B’, children = {e,f}
  • MyTree c with parent_node = a, item = ‘C’, children = { }
  • MyTree d with parent_node = None, item = ‘D’, children = {g,h,i}
  • MyTree e with parent_node = b, item = ‘E’, children = { }
  • MyTree f with parent_node = b, item = ‘F’, children = { }
  • MyTree g with parent_node = d, item = ‘G’, children = { }
  • MyTree h with parent_node = d, item = ‘H’, children = { }
  • MyTree i with parent_node = d, item = ‘I’, children = { }

Subsections of MyTree I

Summary

In this module we have introduce vocabulary related to trees and what makes a tree a tree. To recap, we have introduced the following:

  • Child - a node with an edge that connects to another node closer to the root.
  • Degree
    • Degree of a node - the number of children a node has. The degree of a leaf is zero.
    • Degree of a tree - the number of children the root of the tree has.
  • Edge - connection between two nodes. In a tree, the edge will be pointing in a downward direction.
  • Leaf - a node with no children.
  • Node - the general term for a structure which contains an item, such as a character or even another data structure.
  • Parent - a node with an edge that connects to another node further from the root. We can also define the root of a tree with respect to this definition;
  • Root - the topmost node of the tree; a node with no parent.

Now we will work on creating our own implementation of a tree. These definitions will serve as a resource to us when we need refreshing on meanings; feel free to refer back to them as needed.

Welcome!

This page is the main page for Tree Traversal

Subsections of Tree Traversal

Introduction

In the last module, we covered the underlying vocabulary of trees and how we can implement our own tree. To recall, we covered: node, edge, root, leaf, parent, child, and degree.

For this module we will expand on trees and gain a better understanding of how powerful trees can be. As before, we will use the same tree throughout the module for a guiding visual example.

Family Tree Family Tree

Terms I

YouTube Video

Many of the terms used in trees relate to terms used in family trees. Having this in mind can help us to better understand some of the terminology involved with abstract trees. Here we have a sample family tree. Family Tree Family Tree

  • Ancestor - The ancestors of a node are those reached from child to parent relationships. We can think of this as our parents and our parent’s parents, and so on.
    • Let’s look at all of the ancestors of each of our nodes in the family tree.
      • Ava’s ancestors: Uzzi, Joe, Myra. This is because, Uzzi is the parent of Ava, Joe is the parent of Uzzi, and Myra is the parent of Joe. Try to work out the following and click the name to reveal the ancestors.
      Uma:Zia, Myra - **Zia** is the parent of Uma and **Myra** is the parent of Zia.
      Myra:None - Myra does not have a parent node.
      Raju: Myra - **Myra** is the parent of Raju.
      Bev: Uzzi, Joe, Myra - **Uzzi** is the parent of Bev, **Joe** is the parent of Uzzi, and **Myra** is the parent of Joe.
  • Descendant - The descendants of a node are those reached from parent to child relationships. We can think of this as our children and our children’s children and so on.
    • Let’s look at all of the descendants of each of our nodes in the family tree.
      • Ava’s descendants: None. Ava has no child nodes and thus, no descendants. Try to work out the following and click the name to reveal the descendants.
      Uma:Ang - **Ang** is the child of Uma
      Myra:Raju, Joe, Zia, Uzzi, Bert, Uma, Bev, Ava, Ang, Isla, Eoin - All of the nodes in a tree will be descendants of the root. To work it out: **Raju, Joe** and **Zia** are the children of Myra, **Uma** is the child of Zia, **Ang** is the child of Uma, and we can work the rest out for Joe's children.
      Raju:None - Raju has no child nodes.
      Bev:Isla, Eoin - **Isla** is the child of Bev and **Eoin** is the child of Isla.
  • Siblings - Nodes which share the same parent
    • We can think about the siblings of all of our nodes in the family tree.
      • Ava’s siblings: Bev - Uzzi is the parent node of Ava; Uzzi has two child nodes, Ava and Bev. Try to work out the following and click the name to reveal the siblings.
      Uma:None - Zia is the parent node of Uma; Zia has only one child node, Uma.
      Myra:None - Myra is the root and thus does not have a parent node resulting in no siblings.
      Raju:Joe, Zia - Myra is the parent node of Raju; Myra has three child nodes, **Joe**, **Zia**, and Raju
      Bev:Ava - Uzzi is the parent node of Bev; Uzzi has two child nodes, Bev and **Ava**.

Recursion Refresh

YouTube Video

Subtrees Subtrees

Info

A recursive program is broken into two parts:

  • A base case—a simple version of the problem that can be solved directly, and
  • A recursive case—a general solution to the problem that uses smaller versions of the problem to compute the solution to the larger problem.

In principle, the recursive case breaks the problem down into smaller portions until we reach the base case. Recursion presents itself in many ways when dealing with trees.

Trees are defined recursively with the base case being a single node. Then we recursively build the tree up. With this basis for our trees, we can define many properties using recursion rather effectively.

Terms II

YouTube Video

We can describe the sizes of trees and position of nodes using different terminology, like level, depth, and height.

Family Tree Family Tree

  • Level - The level of a node characterizes the distance between the node and the root. The root of the tree is considered level 1. As you move away from the tree, the level increases by one.
    • For our family tree example, what nodes are in the following levels? Think about the answer and then click corresponding arrow.
      Level 1:Myra - Level 1 is always the root
      Level 2:Raju, Joe, Zia - These are the nodes which are 1 edge away from the root.
      Level 3:Uzzi, Bert, Uma - These are the nodes which are 2 edges away from the root.
      Level 4:Bev, Ava, Ang - These are the nodes which are 3 edges away from the root.
      Level 5:Isla - This is the only node which is 4 edges away from the root.
      Level 6:Eoin - This is the only node which is 5 edges away from the root.
  • Depth - The depth of a node is its distance to the root. Thus, the root has depth zero. Level and depth are related in that: level = 1 + depth.
    • For our family tree example, what nodes have the following depths?
      Depth 0:Myra - The root will always be at depth 0.
      Depth 1:Raju, Joe, Zia - These are the nodes which are 1 edge away from the root.
      Depth 2:Uzzi, Bert, Uma - These are the nodes which are 2 edge away from the root.
      Depth 3:Bev, Ava, Ang - These are the nodes which are 3 edge away from the root.
      Depth 4:Isla - This is the only node which is 4 edges away from the root.
      Depth 5:Eoin - This is the only node which is 5 edges away from the root.
  • Height of a Node - The height of a node is the longest path to a leaf descendant. The height of a leaf is zero.
    • For our family tree example, what nodes have the following heights?
      Height 0:Raju, Eoin, Ava, Bert, Ang - The leaves always have height 0.
      Height 1:Isla, Uma - `Isla -> Eoin` and `Uma -> Ang`
      Height 2:Bev, Zia - `Bev -> Isla -> Eoin` and `Zia -> Uma -> Ang`
      Height 3:Uzzi - `Uzzi -> Bev -> Isla -> Eoin`
      Height 4:Joe - `Joe -> Uzzi -> Bev -> Isla -> Eoin`
      Height 5:Myra - `Myra -> Joe -> Uzzi -> Bev -> Isla -> Eoin`
  • Height of a Tree - The height of a tree is equal to the height of the root.
    • Our family tree would have height 5

Terms III

YouTube Video

When working with multidimensional data structures, we also need to consider how they would be stored in a linear manner. Remember, pieces of data in computers are linear sequences of binary digits. As a result, we need a standard way of storing trees as a linear structure.

Traversal Tree Traversal Tree

  • Path - a path is a sequence of nodes and edges, which connect a node with its descendant. We can look at some paths in the tree above:

    • From Q to O: QRO
    From `Q` to `Y`:`QWY`
    From `R` to `P`:`RP`
  • Traversal is a general term we use to describe going through a tree. The following traversals are defined recursively.

Preorder Traversal

  1. Access the root, record its value.
  2. Run the preorder traversal each of the children

YouTube Video

  • The Pre refers to the root, meaning the root goes before the children.
  • Remember: Root Children
  • For the above tree, the preorder traversal could result in: QWYUERIOPTA Preorder Traversal Preorder Traversal

Postorder Traversal

  1. Run the postorder traversal on each of the children
  2. Access the root, record its value

YouTube Video

  • The Post refers to the root, meaning the root goes after the children.
  • Remember: Children Root
  • For the above tree, the postorder traversal could result in: YUWEIOPRATQ Postorder Traversal Postorder Traversal

When we talk about traversals for general trees we have used the phrase ’the traversal could result in’. We would like to expand on why ‘could’ is used here. Each of these general trees are the same but their traversals could be different. The key concept in this is that for a general tree, the children are an unordered set of nodes; they do not have a defined or fixed order. The relationships that are fixed are the parent/child relationships. Traversal Tree Traversal Tree Traversal Tree1 Traversal Tree1 Traversal Tree2 Traversal Tree2

Tree Preorder Postorder
Tree 1 QWYUERIOPTA YUWEIOPRATQ
Tree 2 QETARIOPWUY EATIOPRUYWQ
Tree 3 QROPITAEWUY OPIRATEUYWQ

MyTree Recursive I

Again, we want to be able to implement a working version of a tree. From the last module, we had functions to add children, remove children, get attributes, and instantiate MyTree. We will now build upon that implementation to create a true tree.

Info

A recursive program is broken into two parts:

  • A base case—a simple version of the problem that can be solved directly, and
  • A recursive case—a general solution to the problem that uses smaller versions of the problem to compute the solution to the larger problem.

MyTree with recursion

Recall that in the previous module, we were not yet able to enforce the no cycle rule. We will now enforce this and add other tree functionality.

Info

Disclaimer: In the previous module we had a disclaimer that stated our implementation would not prevent cycles. The following functions and properties will implement recursion. Thus, we can maintain legal tree structures!

In the first module, we discussed how we can define trees recursively, meaning a tree consists of trees. We looked at the following example. Each red dashed line represented a distinct tree, thus we had five trees within the largest tree making six trees in total. Subtrees Subtrees

We will use our existing implementation from the first module. Now to make our tree recursive, we will include more getter functions as well as functions for traversals and defining node relationships.

UML UML


Get depth, height, size, and root

We can define each of these recursively.

YouTube Video

Get Depth
  • Depth - The depth of a node is its distance to the root. Thus, the root has depth zero.

We can define the depth of a node recursively:

  • Base case: we are at the root and the depth is zero
  • Recursive case: for any other node, the depth is 1 plus the depth of the parent
function GETDEPTH()
    if ROOT
        return 0
    else 
        return 1 + PARENT.GETDEPTH()
end function
Get Height
  • Height of a Node - The height of a node is the longest path to a leaf descendant. The height of a leaf is zero.

We can define the height of a node recursively:

  • Base case: we are at the leaf and the height is zero
  • Recursive case: for any other node, return 1 plus the maximum height of its children
function GETHEIGHT()
    if LEAF
        return 0
    else 
        MAX = 0
        for CHILD in CHILDREN
            CURR_HEIGHT = CHILD.GETHEIGHT()
            if CURR_HEIGHT > MAX
                MAX = CURR_HEIGHT
        return 1 + MAX
end function
Get Root
  • Root - the topmost node of the tree; a node with no parent.

We can define returning the root recursively:

  • Base case: we are at the root so return it
  • Recursive case: for any other node, return the root of the nodes parent
function GETROOT()
    if ISROOT()
        return this tree
    else
        return PARENT.GETROOT()
end function
Get Size

We define the size of a tree as the total number of children.

function GETSIZE()
    SIZE = 1
    for CHILD in CHILDREN
        SIZE += CHILD.GETSIZE()
    return SIZE
end function

Find a Value

To find a value within our tree, we will traverse down a branch as far as we can until we find the value. This will return the tree that has the value as the root.

function FIND(VALUE)
	if ITEM is VALUE
		return this node
	for CHILD in CHILDREN
		FOUND = CHILD.FIND(VALUE)
		if FOUND is not NONE
			return FOUND
	return NONE
end function

MyTree Recursive II

Determine relationships (Ancestor, Descendant, Sibling)

We can determine many relationships within the tree. For example, given a node is it an ancestor of another node, a descendant, or a sibling?

YouTube Video

Is Ancestor?

For this function, we are asking: is this node an ancestor of the current instance? In this implementation, we will start at our instance and work down through the tree trying to find the node in question. With that in mind, we can define this process recursively:

  • Base case: we are at the node in question, so return true OR we are at a leaf so return false.
  • Recursive case: run the method from each of the children of the node.
function ISANCESTOR(TREE)
    if at TREE
        return true
    else if at LEAF
        return false
    else 
        for CHILD in CHILDREN
            FOUND = CHILD.ISANCESTOR(TREE)
            if FOUND
                return true
        return false
end function
Is Descendant?

For this function, we are asking: is this node a descendant of the current instance? In this implementation, we will start at our instance and work up through the tree trying to find the node in question. With that in mind, we can define this process recursively:

  • Base case: we are at the node in question, so return true OR we are at the root so return false.
  • Recursive case: run the method from the parent of the node.
function ISDESCENDANT(TREE)
    if at TREE
        return true
    else if at ROOT
        return false
    else 
        return PARENT.ISDESCENDANT(TREE)
end function
Is Sibling?

For this function, we are asking: is this node a sibling of the current instance? To determine this, we can get the parent of the current instance and then get the parents children. Finally, we check if the node in question is in that set of children.

function ISSIBLING(TREE)
    if TREE in PARENT's CHILDREN
        return true
    else 
        return false
end function

Lowest common ancestor

In any tree, we can say that the root is a common ancestor to all of the nodes. We would like to get more information about the common ancestry of two nodes. For this function, we are asking: which node is the first place where this instance and the input node’s ancestries meet? Similar to our ISDESCENDANT, we will work our way up the tree to find the point where they meet

  • Base case: we are at our tree so return the tree OR we are at an ancestor of our tree so return the instance OR we are at the root so return nothing
  • Recursive case: run the method from the parent.
function LOWESTANCESTOR(TREE)
    if at TREE
        return TREE
    else if ISANCESTOR(TREE)
        return instance
    else if at ROOT
        return NONE
    else
        return PARENT.LOWESTANCESTOR(TREE)
end function

Path from the root

This function will generate the path which goes from the root to the current instance.

function PATHFROMROOT(PATH)
    if NOT ROOT
        PARENT.PATHFROMROOT(PATH)
    append ITEM to PATH
end function

MyTree Recursive III

Traversals

In this module we have talked about two traversals: preorder and postorder. Both of these are defined recursively and the prefix refers to the order of the root.

Preorder

In a preorder traversal, first we access the root and then run the preorder traversal on the children.

function PREORDER(RESULT)
    append ITEM to RESULT
    FOR CHILD in CHILDREN
           CHILD.PREORDER(RESULT)
end function
Postorder

In a postorder traversal, first we run the postorder traversal on the children then we access the root.

function POSTORDER(RESULT)
   FOR CHILD in CHILDREN
           CHILD.POSTORDER(RESULT)
   append ITEM to RESULT
end function

Summary

In this section, we discussed more terminology related to trees as well as tree traversals. To recap the new vocabulary:

  • Ancestor - The ancestors of a node are those reached from child to parent relationships. We can think of this as our parents and the parents of our parents, and so on.
  • Depth - The depth of a node is its distance to the root. Thus, the root has depth zero. Level and depth are related in that: level = 1 + depth.
  • Descendant - The descendants of a node are those reached from parent to child relationships. We can think of this as our children and our children’s children and so on.
  • Height of a Node - The height of a node is the longest path to a leaf descendant. The height of a leaf is zero.
  • Height of a Tree - The height of a tree is equal to the height of the root.
  • Level - The level of a node characterizes the distance the node is from the root. The root of the tree is considered level 1. As you move away from the tree, the level increases by one.
  • Path - a sequence of nodes and edges which connect a node with its descendant.
  • Siblings - Nodes which share the same parent
  • Traversal is a general term we use to describe going through a tree. The following traversals are defined recursively.
    • Preorder Traversal (Remember: Root Children):
      1. Access the root
      2. Run the preorder traversal on the children
    • Postorder Traversal (Remember: Children Root):
      1. Run the postorder traversal on the children
      2. Access the root.

Welcome!

This page is the main page for Tries

Subsections of Tries

Tries

YouTube Video

Recall that in the beginning of our discussions about trees, we looked at a small tree which contained seven strings as motivation for trees. This was a small example of a trie (pronounced ’try’) which is a type of tree that can represent sets of words. Trie Small Example Trie Small Example

Tries can be used for a variety of tasks ranging from leisurely games to accessibility applications. One example is ‘Boggle’ where players have a set of random letters and try to make as many words as possible. To code this game, we could create a vocabulary with a trie then traverse it to determine if players have played legal words. We can also use tries to provide better typing accessibility. Users could type a few letters of a word and our code could traverse the trie and suggest what letters or words they may be trying to enter.

A trie is a type of tree with some special characteristics. First it must follow the guidelines of being a tree:

  • There must be a single root,
  • Each child node has a single parent node,
  • It must be fully connected (no disjoint parts), and
  • There can be no cycles (no loops).

The special characteristics for tries are:

  • By starting at the root and traversing parent to children relationships we can build user-defined words, and
  • Each node has a boolean property to indicate if it is the end of a word.

In this course, we will display nodes with two circles as a convention to show which nodes are the end of words. Looking at this small trie as an example, we can determine which words are contained in our trie. Trie Small Example Trie Small Example We start at the root, which will typically be an empty string, and traverse to a double lined node. "" -> a -> l -> l. Thus, the word ‘all’ is contained in our trie. Words within our tries do not have to end at leaves. For example, we can traverse "" -> a for the word ‘a’. We say this trie ‘contains’ seven words: ‘a’, ‘an’, ‘and’, ‘ant’, ‘any’, ‘all’, and ‘alp’.

Trie Example

Let’s look at another example of a trie. Here we have a larger trie. Think about how many words are captured by the tree; click the tree to see how many!

![Trie Example](../../images/4/4Trie_Example.png) This tree contains **12** words: 'ate', 'an', 'and', 'ant', 'app', 'apple', 'cat', 'can', 'cup', 'by', 'be', and 'been'.

While the ‘a’, ‘at’, and ‘bee’ are words in the English language, they are not recognized by our trie. Depending on what the user intended, this could be by design. When we build our tries, users will input words that are valid for their vocabulary. Tries are not limited to the English language and can be created for any vocabulary.

MyTrie I

To implement our own trie, we will build off of MyTree that we built recursively. We will add an attribute to our tree to reinforce which nodes are words and which ones are not.

UML UML

Attributes

We have the existing attributes of MyTree: parent, children, and item. For MyTrie, we introduce the boolean attribute is_word to delineate if our trie is a word.

Adding a Word

YouTube Video

To add a word to our trie, we traverse through the trie letter by letter. We can define this recursively.

  • Base Case: length of the word is zero and it was already a word in our trie, return false (because we did not add the word) OR length of the word is zero and it is not already a word in our trie, set the boolean for the node at the end of the word to true and return true
  • Recursive case: split the string into the first character and the rest. Get the node of the first letter of the string; if it does not exist create it, then run the add word function on the remainder of the string.
function ADDWORD(WORD)
    if WORD length is 0
        if already a word
            return false
        else
            set is_word to true
            return true
    else
        FIRST = first character of WORD
        REMAIN = remainder of WORD
        CHILD = FINDCHILD(FIRST)
        if CHILD is NONE
            NODE = new MyTrie with item equal FIRST
            insert NODE into our existing trie 
            CHILD = NODE
        return CHILD.ADDWORD(REMAIN)
end function

Removing a Word

YouTube Video

Similar to adding a word, we traverse our trie letter by letter. Once we get to the end of the word set is_word to false. If the word ends at a leaf, we will remove the leaf (then if the second to last character is a leaf, we remove the leaf and so on). If the word does not end in a leaf, meaning another word uses that node, we will not remove the node.

  • Base Case: length of the word is zero and it was not a word in our trie, return false (because we did not remove the word) OR length of the word is zero and it is already a word in our trie, set the boolean for the node at the end of the word to false and return true
  • Recursive case: split the string into the first character and the rest. Get the node of the first letter of the string, if that node does not exist, return false (because we did not remove the word). If the node does exist run remove word on the child for the remainder of the word. After that, if the node’s is_word is false and it is a leaf, remove the node.
function REMOVEWORD(WORD)
    if WORD length is 0
        if already not a word
            return false
        else
            set is_word to false
            return true
    else
        FIRST = first character of WORD
        REMAIN = remainder of WORD
        CHILD = FINDCHILD(FIRST)
        if CHILD is NONE
            return false
        else
            RET = CHILD.REMOVEWORD(REMAIN)
            if CHILD is not a word AND CHILD is a leaf
                REMOVECHILD(CHILD)
            return RET
end function

Check if trie contains word

YouTube Video

Again, we will traverse the trie letter by letter. Once we get to the last letter, we can return that nodes is_word attribute. There is a chance that somewhere in our word, the letter is not a child of the previous node. If that is the case, then we return false.

function CONTAINSWORD(WORD)
    if WORD length is 0
        return `is_word`
    else
        FIRST = first character of WORD
        REMAIN = remainder of WORD
        CHILD = FINDCHILD(FIRST)
        if CHILD is NONE
            return false
        else
            return CHILD.CONTAINSWORD(REMAIN)
end function

MyTrie II

Getters

YouTube Video

Getting word count

For this function, we want to get the total number of words that are contained within our trie. We will fan out through all of the children and count all of the nodes that have their is_word attribute equal to true.

function WORDCOUNT()
    COUNT = 0
    if is_word
        COUNT = 1
    for CHILD in CHILDREN
        COUNT += CHILD.WORDCOUNT()
    return COUNT
end function

Get max word length

Next, we want to get the longest word contained in our trie. To do this, we will recurse each child and find the maximum length of the child.

  • Base Case: we are at a leaf and it is a word, return 0
  • Recursive Case: declare a maximum of -1 for a tracker and then for each child run the maximum word length function on it. If the value returned from the child is greater than our maximum tracker, set the tracker equal to the value. Once we have iterated all of the children, return the maximum tracker plus one.
function MAXWORD()
    if LEAF and is_word
        return 0
    else
        MAX = -1
        for CHILD in CHILDREN
            COUNT = CHILD.MAXWORD()
            if COUNT greater than MAX
                MAX = COUNT
        return MAX + 1
end function

Get completions

YouTube Video

This function will act as an auto-complete utility of sorts. A user will input a string of characters and we will return all of the possible words that are contained in our trie. This will happen in two phases. First, we traverse the trie to get to the end of the input string (lines 1-12). The second portion then gets all of the words that are contained after that point in our trie (lines 14-21).

function COMPLETIONS(WORD)
1.    if WORD length greater than 0
2.        FIRST = first character of WORD
3.        REMAIN = remainder of WORD
4.        CHILD = FINDCHILD(FIRST)
5.        if CHILD is none
6.            return []
7.        else
8.            COMPLETES = CHILD.COMPLETIONS(REMAIN)
9.            OUTPUT = []
10.            for COM in COMPLETES
11.                append CHILD.item + COM to OUTPUT
12.            return OUTPUT
13.    else
14.        OUTPUT = []
15.        if is_word
16.            append ITEM to OUTPUT
17.        for CHILD in CHILDREN
18.            COMPLETES = CHILD.COMPLETIONS("")
19.            for COM in COMPLETES
20.                append CHILD.item + COM to OUTPUT
21.        reutrn OUTPUT
end function

Welcome!

This page is the main page for Binary Trees

Subsections of Binary Trees

Binary Tree

YouTube Video

Binary Tree Binary Tree

A binary tree is a type of tree with some special conditions. First, it must follow the guidelines of being a tree:

  • There must be single root,
  • each child node must have a single parent node,
  • it must be fully connected (no disjoint parts), and
  • there can be no cycles (no loops).

The special conditions that we impose on binary trees are the following:

  • Each node has at most 2 children (nodes can have 0, 1, or 2 children), and
  • unlike general trees, the children in a binary tree are not an unordered set. The children must be ordered such that:
    • all of the descendants in the left tree are less than the parent’s value, and
    • all of the descendants in the right tree are greater than the parent’s value

To reinforce these concepts, we will look at examples of binary trees and examples that are not binary trees.

Binary Tree Examples

YouTube Video


Valid Binary Trees

Single Node Single Node This is a valid binary tree. We have a single node, the root, with no children. As with general trees, binary trees are built recursively. Thus, each node and its child(ren) are trees themselves.

Unbalanced Unbalanced This is also a valid binary tree. All of the left children are less than their parent. The node with item ‘10’ is also in the correct position as it is less than 12, 13, and 14 but greater than 9.

Balanced Balanced We have the same nodes but our root is now 12 whereas before it was 14. This is also a valid binary tree.

Alphabet Binary Tree Alphabet Binary Tree Here we have an example of a binary tree with alphabetical items. As long as we have items which have a predefined order, we can organize them using a binary tree.


Invalid Binary Trees

Alphabet Non-Binary Tree Alphabet Non-Binary Tree We may be inclined to say that this is a binary tree: each node has 0, 1, or 2 children and amongst children and parent nodes, the left child is smaller than the parent and the right child is greater than the parent. However, in binary trees, all of the nodes in the left tree must be smaller than the root and all of the nodes in the right tree must be larger than the root. In this tree, D is out of place. Node D is less than node T but it is also less than node Q. Thus, node D must be on the right of node Q.

Too Many Children Too Many Children In this case, we do not have a binary tree. This does fit all of the criteria for being a tree but not the criteria for a binary tree. Nodes in binary trees can have at most 2 children. Node 30 has three children.

Traversals

YouTube Video

In the first module we discussed two types of traversals: preorder and postorder. Within that discussion, we noted that for general trees, the preorder and postorder traversal may not be unique. This was due to the fact that children nodes are an unordered set.

Info

We are now working with binary trees which have a defined child order. As a result, the preorder and postorder traversals will be unique! These means that for a binary tree when we do a preorder traversal there is exactly one string that is possible. The same applies for postorder traversals as well.

Recall that these were defined as such:

  • Preorder Traversal (Remember: Root Children):
    1. Access the root
    2. Run the preorder traversal on the children
  • Postorder Traversal (Remember: Children Root):
    1. Run the postorder traversal on the children
    2. Access the root.

Now for binary trees, we can modify their definitions to be more explicit:

  • Preorder Traversal (Remember: Root Left Right):
    1. Access the root
    2. Run the preorder traversal on the left child
    3. Run the preorder traversal on the right child
  • Postorder Traversal (Remember: Left Right Root):
    1. Run the postorder traversal on the left child
    2. Run the postorder traversal on the right child
    3. Access the root.

Let’s practice traversals on the following binary tree. Traversal Tree Traversal Tree

Preorder Traversal

Preorder Preorder

Postorder Traversal

Postorder Postorder

In-Order Traversal

YouTube Video

Since we have fixed order on the children, we can introduce another type of traversal: in-order traversal.

  • In-order Traversal:

    1. Run the in-order traversal on the left child
    2. Access the root, write its value
    3. Run the in-order traversal on the right child
    • Remember: Left Root Right

In-order In-order

MyBinaryTree

YouTube Video

Our implementation of binary trees will inherit from our MyTree implementation as binary trees are types of trees. Thus, MyBinaryTree will have the functionality of MyTree in addition to the following.

UML UML

Attributes

The binary tree has two attributes

  • Left Child: an instance of MyBinaryTree, the item should be less than the item of the parent.
  • Right Child: an instance of MyBinaryTree, the item should be greater than the item of the parent.

Miscellaneous Functions

  • Get Size

    • Will override the MyTree size function. If the tree is empty then we return zero. If the tree is not empty then call the MyTree size function.
  • Is Empty

    • Will return true if the node we have called the function from is empty and false if otherwise.
  • To Sorted List

    • Will get all of the nodes items and sort them
function TOSORTEDLIST()
    LIST = []
    if there`s LEFTCHILD
        LIST = LIST + LEFTCHILD.TOSORTEDLIST
    LIST = LIST + ITEM
    if there`s RIGHTCHILD
        LIST = LIST + RIGHTCHILD.TOSORTEDLIST
    return LIST

Inserting Children

YouTube Video

When inserting children to a binary tree, we must take some special considerations. All of the node items in the left tree must be less than the parent node item and all of the node items in the right tree must be greater than the parent node item.

The general procedure for adding a child is the following: Binary Tree Flowchart Binary Tree Flowchart

Suppose that we have the following tree and we want to add a node with item ‘85’. Click the binary tree to see the resulting tree.

![Tree To Add To](../../images/4/4Binary_Add.png) ![Tree Adding](../../images/4/4Binary_AddChild.gif)
function INSERT(VALUE)
    if node is empty:
        set nodes item to value
    else:
        if node.ITEM is VALUE
            return false
        else if node.ITEM > VALUE 
            LC = node`s left child
            if LC is NONE
                CHILD = new BINARYTREE with root.ITEM equal VALUE
                add CHILD to nodes children
                set node.LEFTCHILD equal to CHILD
                return true
            else
                return LC.INSERT(VALUE)
        else
            RC = node`s right child
            if RC is NONE
                CHILD = new BINARYTREE with root.ITEM equal VALUE
                add CHILD to nodes children
                set node.RIGHTCHILD equal to CHILD
                return true
            else
                return RC.INSERT(VALUE)
end function

Removing Children

YouTube Video

Removing children is not as straightforward as inserting them. The general procedure for removing a child is to replace that nodes value with its smallest right descendant. First we will traverse the binary tree until we find the node with the value we are trying to remove (lines 18-32 below). Then we have three separate cases, discussed in detail below.

Removing a Leaf

Removing a leaf is the most straightforward. We remove the value from the node and then sever the connection between parent and child. (lines 5-7 below)

Suppose we have this binary tree and we want to remove value 5. What do you think the resulting binary tree will look like? Click the binary tree to see the result.

![Tree to Remove Leaf](../../images/4/4Bin_Remove.png) ![Result of Remove Leaf](../../images/4/4Bin_Remove2.png)

Removing a Node without Right Child

When we remove a value from a node that does not have a right child, we cannot replace the value with the smallest right child. In this instance we will instead replace the value with the smallest left child then prune the tree to clean it up. Once we replace the value, we must switch the node’s left child to be the right child in order to maintain proper binary tree structure. (lines 8-13 below)

Suppose we have this binary tree and we want to remove value 4. What do you think the resulting binary tree will look like? Click the binary tree to see the result.

![Tree to Remove w/o RightChild](../../images/4/4Bin_Remove2.png) ![Result of Remove w/o RightChild](../../images/4/4Bin_Remove3.png)

Removing a Node with Right Child

When we remove a value from a node that has a right child, we can replace the value with the nodes smallest right child. (Lines 14-17 Below)

Suppose we have this binary tree and we want to remove value 10. What do you think the resulting binary tree will look like? Click the binary tree to see the result.

![Tree to Remove with RightChild](../../images/4/4Bin_Remove.png) ![Result of Remove with RightChild](../../images/4/4Bin_Remove1.png)

Complete Pseudocode

1. function REMOVE(VALUE)
2.    if node is empty:
3.        error
4.    if node.ITEM is VALUE
5.        if node is a leaf
6.            set node.ITEM to none
7.            return TRUE
8.        else if node has no right child
9.            node.ITEM = LEFTCHILD.REMOVESMALLEST()
10.           prune left-side
11.           store left child in right child
12.           set left child to none    
13.           return TRUE
14.        else
15.            node.ITEM = RIGHTCHILD.REMOVESMALLEST()
16.            prune right-side
17.            return TRUE
18.    else
19.        if node.ITEM > VALUE
20.            if node has LEFTCHILD
21.                SUCCESS = LEFTCHILD.REMOVE(VALUE)
22.                prune left-side
23.                return SUCCESS
24.            else
25.                return FALSE
26.        else
27.            if node has RIGHTCHILD
28.                SUCCESS = RIGHTCHILD.REMOVE(VALUE)
29.                prune right-side
30.                return SUCCESS
31.            else
32.                return FALSE
33. end function

Extras for Removal

We use the pruning functions to severe the tie between parent and child nodes.

function PRUNERIGHT()
    if RIGHTCHILD has no value
        REMOVECHILD(RIGHTCHILD)
        set this nodes RIGHTCHILD former RIGHTCHILDs RIGHTCHILD
        if RIGHTCHLID is not none
            ADDCHILD(RIGHTCHILD)
end function
function PRUNELEFT()
    if LEFTCHILD has no value
        REMOVECHILD(LEFTCHILD)
        set this nodes LEFTCHILD former LEFTCHILDs RIGHTCHILD
        if LEFTCHILD is not none
            ADDCHILD(LEFTCHILD)
end function

We use the remove smallest function to retrieve the smallest value in the binary tree which will replace our value.

function REMOVESMALLEST()
    if node has left child
        REPLACEMENT = LEFTCHILD.REMOVESMALLEST
        prune left-side
        return REPLACEMENT
    else
        REPLACEMENT = node.ITEM
        if node has right child
            node.ITEM = RIGHTCHILD.REMOVESMALLEST()
            prune right-side
        else
            node.ITEM = NONE
        return REPLACEMENT
end function

Balance

Unbalanced Unbalanced While this is a valid binary tree, it is not balanced. Let’s look at the following tree.

Balanced Balanced We have the same nodes but our root is now 12 whereas before it was 14. This is a valid binary tree. We call this a balanced binary tree. A balanced binary tree looks visually even amongst the left and right trees in terms of number of nodes.

Note: Balancing is not necessary for a valid binary tree. It is, however, important in terms of time efficiency to have a balanced tree. For example, the number of actions when inserting an element is about the same as the number of levels in the tree. If we tried to add the value 11 into the unbalanced tree, we would traverse 5 nodes. If we tried to add the value 11 in to the balanced tree, we would traverse just 3 nodes.

We believe that balancing binary trees is out of the scope of this course. If you are interested in how we might balance a tree, feel free to check out these videos by Dr. Joshua Weese.

YouTube Video

YouTube Video

Iii-graphs

Subsections of Iii-graphs

Welcome!

This page is the main page for Graphs: Matrix Representation

Subsections of Graphs: Matrix Representation

Introduction

YouTube Video

The next data structure we will introduce is a graph.

Graph Example 1 Graph Example 1

Graphs are multidimensional data structures that can represent many different types of data. We can use graphs to represent electronic circuits and wiring, transportation routes, and networks such as the Internet or social groups.

A popular and fun use of graphs is to build connections between people such as Facebook friends or even connections between performers. One example is the parlor game Six Degrees of Kevin Bacon . Players attempt to connect Kevin Bacon to other performers through movie roles in six people or less.

Six Degrees of Kevin Bacon Six Degrees of Kevin Bacon

For example, Laurence Fishburne and Kevin Bacon are directly connected via ‘Mystic River’. Keanu Reeves and Kevin Bacon have never performed in the same film, but Keanu Reeves and Laurence Fishburne are connected via ‘The Matrix’. Thus, Keanu and Kevin are connected via Laurence.

In this module we will discuss graphs in more detail and build our own implementation of graphs.

Terms I

YouTube Video

We will discuss some of the basic terminology associated with graphs. Some of this vocabulary should feel familiar from the trees section; trees are a specific type of graph!

  • Nodes: Node is the general term for a structure which contains an item.
    • Size: The size of a graph is the number of nodes.
    • Capacity: The capacity of a graph is the maximum number of nodes.
Info

Nodes can be, but are not limited to the following examples: - physical locations (IE Manhattan, Topeka, Salina), - computer components (IE CPU, GPU, RAM), or - people (IE Kevin Bacon, Laurence Fishburne, Emma Stone)

  • Edges: Edges are the connection between two nodes. Depending on the data, edges can represent physical distance, films, cost, and much more.
    • Adjacent: Node A and node B are said to be adjacent if there is an edge from node A to node B.
    • Neighbors: The neighbors of a node are nodes which are adjacent to the node.
Info

Edges can be, but are not limited to: - physical distances, like the distance between cities or wiring between computer components, - cost, like bus fares, and - films, like the Six Degrees of Kevin Bacon example

  • Cycles: A cycle is a path where the first and last node are the only repeated nodes. More explicitly, this means that we start at node A and are able to end up back at node A.

Example

For example, we can translate the Amtrak Train Station Connections into a graph where the edges represent direct train station connections.

Amtrak Train Graph Amtrak Train Graph ^[Generated using the Amtrak system map from 2018. This graph does not include all stations or connections.]

Within this context, we could say that Little Rock and Fort Worth are adjacent. The neighbors of San Antonio are Fort Worth, Los Angeles, and New Orleans. The Amtrak Train Graph has multiple cycles. One of these is Kansas City -> St. Louis -> Chicago -> Kansas City.

Graph Features

YouTube Video

While trees are a type of graph, graphs can have more functionality than trees. For example, recall that to be a single tree, there could be no disconnected pieces.

  • Connectedness: Graphs do not require being fully connected. There can be disconnected portions within a graph. For example, the following graph shows all of the students in a sophomore biology class. There is an edge between two student nodes if they are Facebook friends.

Friends Group Friends Group

Graphs can also have loops. In a tree, this would be like a node being its own parent, which is not an allowable condition.

  • Loops: Loops are edges which connect a node to itself. These can be useful in depicting graphs that show control flow in programming. In this example, node A is connected to node B and node A is connected to itself.

With a Loop With a Loop

Weighted Graphs

YouTube Video

A weighted graph is a graph which has weights associated with the edges. These weights quantify the relationships, so they can represent dollars, minutes, miles, and many other factors which our data may depend upon.

Weights are not limited to physical quantities; they can also be our own defined similarity in text, product types, and anything for which we can create a similarity measure for. Let’s look at concrete weights using the Amtrak example.

We are able to expand the Amtrak graph from the previous page to include approximate distances in miles between cities.

Amtrak Train Graph with Weights Amtrak Train Graph with Weights ^[Generated using the Amtrak system map from 2018. This graph does not include all stations or connections. Distance was calculated approximately ‘as the crow flies’.]

Now that we have weights defined on our edges, we can compare paths in a different way. When we discussed trees, we just looked at the number of edges it took to get to another node. We can also determine the shortest path between nodes with respect to distance. If we wanted to travel from San Antonio to Kansas City, we may be tempted to travel San Antoinio -> Los Angeles -> Albuquerque -> Kansas City as it has the fewest stops. This trip would take us 2,531 miles (1201+640+690). With the edge weights in mind, a much better route would be San Antonio -> Fort Worth -> Little Rock -> St. Louis -> Kansas City with a total of 1,089 miles(238+320+293+238) traveled.

Directed Graphs

YouTube Video

A directed graph is a graph that has a direction associated with each edge. For example, trees are a directed graph. The edge orientation will imply a fixed direction that we can move about nodes. As with trees, the flat end of the arrow will represent the origin and the arrowhead will represent the destination. If an edge has no arrowheads, then it is assumed that we can traverse both directions.

In the following graph, we have an example distribution network where each store ends up with 5 units in its possession. For example, nine units go from the distribution center to Store A. The distribution center will never receive product from stores as it has no incoming edges.

Sample Distribution Network Sample Distribution Network

Unlike trees, directed graphs can have nodes with multiple incoming edges. We can see an example of this at Store B. The distribution center and Store A both send units to Store B.

Info

In directed graphs, we must be cautious on how we define adjacent. For the following, we would say that the source is adjacent to the target. However, the target is not adjacent to the source.

Source and Target Example Source and Target Example

Formally, node A and node B are said to be adjacent if there is an edge from node A to node B.

When discussing directed graphs, we must also talk about undirected graphs. An undirected graph is a graph in which none of the edges have an orientation. If there is at least one directed edge, then it is considered a directed graph.

  • Undirected Edge: An undirected edge is an edge which has no defined orientation (IE no arrowheads) which implies that we can traverse in either direction. If node A and node B are connected via an undirected edge then we say node A is adjacent to node B and node B is adjacent to node A.
Info

For the following undirected edge, we would say that the source is adjacent to the target and the target is adjacent to the source.

Source and Target Example Source and Target Example

Info

Graph types and appearances can vary wildly. We are not limited to just weighted/unweighted or directed/undirected. We can also have combinations of weighted and directed.

Example

In the following graph, we have an example of a weighted and directed map. This map represents a zoo train where each node represents a station and each edge is a part of the track. Zoo guests can get on and off wherever they desire.

This graph is weighted as guests must pay the associated fee for each part of the track. Our example train also has a one way direction in most cases. The exception to this is the entrance/exit to the aquarium, this part of the track can go either direction.

In this graph, we also have a couple of loops. This would allow for zoo-guests to ride the train around an expansive exhibit such as the elephants or giraffes.

One possible way to tour the zoo for a guest starting at the entrance could be: aquarium, primates, big cats, antelope, giraffes, loop around the giraffes, elephants, aquarium, then exit. Their total payment for just the train would be $14.

Zoo Train Map Zoo Train Map

Matrix Representation

YouTube Video

The first way that we can represent graphs is as matrices. In a matrix representation of a graph, we will have an array with all the nodes and a matrix to depict the edges. The matrix that depicts the edges is called the adjacency matrix.

To build the adjacency matrix, we go through the nodes and edges. If there is an edge with weight w going from i to j, then we put w in the (i,j) spot in our adjacency matrix. If there is no edge from i to j then we put infinity in the spot (i,j). Let’s look at some examples.

Info

An edge that starts at source and ends at target will result in an entry at (source,target) in the adjacency matrix.

Source and Target Example Source and Target Example

Info

For an unweighted graph, we treat the weights as 1 for all edges in our adjacency matrix.

For an undirected edge between nodes i and j, we put an edge from i to j and an edge from j to i.

Example 1

Suppose that we have the following graph:

Matrix Representation Example Matrix Representation Example

Across the top of the following, we have the array of nodes. This give us the index at which each node is located. For example, node A is in spot 1, node B is in spot 2, node C is in spot 3 and so on.

Below that we have the adjacency matrix. For the directed edge with weight 2 that goes from node B to node C, we have the value 2 at (2,3) in the adjacency matrix as B has index 2 and C has index 3. For the directed edge with weight 4 that goes from node A to node F, we have the value 4 at (1,6) in the adjacency matrix as A has index 1 and F has index 6.

Since there is no edge that connects from node A to node B, we have infinity in (1,2).

Matrix Matrix

Example 2

Now suppose we have this graph. We now have some loops present. Matrix Representation Example 2 Matrix Representation Example 2

For example, we have a loop on node E with weight 12 so we will put the value 12 in spot (5,5) as E has index 5.

Matrix 2 Matrix 2

Example 3

Now suppose we have this graph which is undirected and unweighted. Matrix Representation Example 3 Matrix Representation Example 3

Since this graph is unweighted, we will treat all edges as though they have weight equal to one. Since this graph is undirected, each edge will essentially show up twice.

For example, for the edge that connects nodes A and B, we will have an entry in our adjacency matrix at (1,2) and (2,1).

Matrix 3 Matrix 3

UML

YouTube Video

Matrix Graph UML Matrix Graph UML

Attributes

  • nodes: This will keep track of the nodes which are in our graph as well as the node values. The nodes can have any type of value such as numbers, characters, and even other data structures.
  • edges: This will keep track of the edges which are in our graph.
  • size: This will keep track of the number of nodes that are active in our graph.

Upon initialization, we will initialize nodes to be an empty array of size capacity, edges to be an empty two-dimensional array with dimensions capacity by capacity and size to be zero as we start with no actual nodes.

Getters

  • get nodes: returns a list of the nodes with their respective indexes
function GETNODES()
    LIST = []
    for NODE in NODES
        if NODE has a VALUE
            append (VALUE, INDEX) to LIST
    return LIST
  • get edges: returns a list of the edges in the format (source, target, weight)
function GETEDGES()
    LIST = []
    for ROW in EDGES
        for COL in ROW
            VALUE = entry at (ROW,COL)
            if VALUE is not infinity
                    append (ROW,COL,VALUE) to LIST
    return LIST
  • get node: returns the node with the given index. If the index is within the possible range, then we return the value of that node.

  • find node: returns the index of the given node. We iterate through our nodes and if we find that value, then we return the index. Otherwise, return -1.

  • get edge: returns the weight of the edge between the given indexes of the source node and target node. If one or both of the indexes are out of range, then we should return infinity.

  • get capacity: returns the maximum number of nodes we are allowed to have. Upon initialization, we will have a fixed number of possible nodes in our node array. We can simply return the size of this array.

  • get size: returns the size attribute.

  • get number of edges: returns the number of edges currently in the graph. We will iterate through our edges and return the number of entries that were not infinity.

  • get neighbors: returns the neighbors of the given node. We will access our row adjacency matrix that corresponds to the node and return the indexes and values of those entries which are not infinity.

function GETNEIGHBORS(IDX)
    if IDX in range of NODES length
        LIST = []
        ROW = the IDX-th row of EDGES
        for J in range  0 to ROW length
            VALUE = J-th entry of ROW
            if VALUE is not infinity
                    append (J,VALUE) to LIST
        return LIST

Node and Edge Functions

YouTube Video

  • add node: will add a node to the graph with the given value if our graph still has room. Procedurally, we will try to put the node in the first empty place we find. To do this, we start with IDX equal to negative one then loop through all of the indexes of the graphs nodes attribute. At each index, we check if that entry is equal to the value we are trying to add. This will check if the value is already in our graph. If there is nothing in that entry and the IDX variable is still negative one, then we will set IDX equal to that index. We continue looping through the nodes attribute until we reach the end. It is possible that there is more than one open space in the nodes attribute. Thus, by checking if IDX is still negative one we can make sure to put value in the first empty spot. Once we finish going through nodes we check to see if we ever found an open spot. If IDX is still negative one, this would indicate that there was no room. Otherwise, we put value into nodes at spot IDX and increment the size.
function ADDNODE(VALUE)
    IDX = -1
    for NODE in NODES
        if NODE is VALUE 
            return NODE's index
        if NODE has no entry and IDX is -1
            IDX = NODE's index
    if IDX is not -1
        add VALUE to NODES at position IDX
        increment SIZE
    return IDX
  • remove node: will remove a node to the graph with the given value if our graph has the node. We will set the node to be empty and remove any edges that may be attached to it.
function REMOVENODE(IDX)
    if IDX is in the range of our indexes 
        if NODES at position IDX is not empty
            set NODES at IDX to be empty
            decrement SIZE by one
            for J in node indexes 
                set EDGES (J,IDX) equal to infinity 
                set EDGES (IDX,J) equal to infinity
            return true
        else
            return false
    else
        return false 
  • add edge: will add an edge with the given weight which goes from the source node to the target node
function ADDEDGE(SOURCE, TARGET, WEIGHT)
    if SOURCE and TARGET are both in the range of our node indexes
        set EDGES(SOURCE, TARGET) equal to WEIGHT
        return true 
    else
        return false
  • remove edge: will remove the edge which goes from the source node to the target node
function REMOVEEDGE(SOURCE, TARGET)
    if SOURCE and TARGET are both in the range of our node indexes
        if EDGES(SOURCE, TARGET) is not equal to infinity
            set EDGES(SOURCE, TARGET) equal to infinity
            return true 
        else
            return false
    else
        return false
  • add undirected edge: will add two edges with the given weight between the two given nodes
function ADDUNDIRECTEDEDGE(NODE1, NODE2, WEIGHT)
    RES = ADDEDGE(NODE1, NODE2, WEIGHT)
    RES = RES and ADDEDGE(NODE2, NODE1, WEIGHT)
    return RES
  • remove undirected edge: will remove two edges between the two given nodes. We can utilize the remove edge function on ‘NODE1’ to ‘NODE2’ and then on ‘NODE2’ to ‘NODE1’.
function REMOVEUNDIRECTEDEDGE(NODE1, NODE2)
    RES = REMOVEEDGE(NODE1, NODE2)
    RES = RES and REMOVEEDGE(NODE2, NODE1)
    return RES

Summary

In this module, we have introduced the graph data structure. We also looked at how we would implement a graph using a matrix representation. We introduced the following new concepts in this module:

  • Directed Graphs: A directed graph is a graph that has a direction associated with each edge. The flat end of the arrow will represent the origin and the arrowhead will represent the destination. If an edge has no arrowheads, then it is assumed that we can traverse both directions.

  • Edges: Edges are the connection between two nodes. Depending on the data, edges can represent physical distance, films, cost, and much more.

    • Adjacent: Node A and node B are said to be adjacent if there is an edge from node A to node B.
    • Neighbors: The neighbors of a node are nodes which are adjacent to the node.
    • Undirected Edge: An undirected edge is an edge which has no defined orientation (IE no arrowheads). If node A and node B are connected via an undirected edge then we say node A is adjacent to node B and node B is adjacent to node A.
  • Loops: Loops are edges which connect a node to itself.

  • Nodes: Node is the general term for a structure which contains an item.

    • Size: The size of a graph is the number of nodes.
    • Capacity: The capacity of a graph is the maximum number of nodes.
  • Weighted Graphs: A weighted graph is a graph which has weights associated with the edges. These weights will quantify the relationships so they can represent dollars, minutes, miles, and many other factors which our data may depend on.

In the next module, we will look at a list implementation of graphs and when we might use one implementation over the other.

Welcome!

This page is the main page for Graphs: List Representation

Subsections of Graphs: List Representation

Introduction

YouTube Video

In the previous module, we introduced graphs and a matrix-based implementation. For this module, we will continue working with graphs and change our implementation to lists.

Why Another Implementation?

When using graphs, a lot of situational variation can occur. Some graphs can have a few nodes with many edges, many nodes with few edges, and so on. When we use the matrix implementation, we initialize a matrix with the number of columns and rows equal to the number of nodes. For example, if we have a graph with 20 nodes, our adjacency matrix would have 20 rows and 20 columns, resulting in 400 potential entries.

First let’s look at the implementation and then we will discuss when one may be better than the other.

List Representation

YouTube Video

In the matrix representation, we had an array of the node items. In the list representation, we will have an array of node objects. Each node object will keep track of the node item, the node index, and the outgoing edges.

Set Up Set Up

The item can be any object and the index will be a value within our capacity. The edges will be a list of pairs where the first entry is the index of the target node and the second entry is the weight of the edge.

Since each node will track its neighbors, it is important that we are consistent in our indexing of nodes. If our nodes were to get out of order, then our edges would as well.

Example 1

Consider the following graph which we saw in the matrix representation.

Example 1 Example 1

The following list of nodes depicts the graph above. We can see that each node object has the item and index.

If we look closer at the edges of the node with item A and index 1, we see that the set of edges is equal to [(4, 3.0), (6, 4.0)]. This corresponds to the fact that there are two edges with the source as node 1. The first ordered pair, (4, 3.0), means that there is an edge with source node 1 (A) and target node 4 (D) that has weight 3. We can confirm that in our graph we do have an edge from A to D with weight 3.

List Representation for Example 1 List Representation for Example 1

Example 2

The following includes a couple of examples of loops within our graph.

Example 2 Example 2

We have loops on nodes D, E, and F in our graph. Recall that a loop is an edge where the source and target are the same. For example, we have an edge with source D and target D that has weight 12. We see this in our list representation in the node object with item D and index 4, where we have the entry (4,12.0) in the edges.

List Representation for Example 2 List Representation for Example 2

Dense VS Sparse

YouTube Video

When considering which implementation to use, we need to consider the connectivity in our graph. The terms that we use to describe the connectedness are dense and sparse.

  • Dense Graph: A dense graph is a graph in which there is a large number of edges. Typically in a dense graph, the number of edges is close to the maximum number of edges.
  • Sparse Graph: A sparse graph is a graph in which there is a small number of edges. In this case the number of edges is considerably less than the maximum number of edges.
Info

Intuitively, we can think of dense and sparse in terms of populations. For example, if 100 people lived in a city block, we can consider that to be densely populated. If 100 people lived in 100 square miles we can consider that to be sparsely populated.

Let’s look at some motivating examples to get an idea of how the different structures will handle these cases.

Dense

The following is a dense graph. In this case, our graph does have the maximum number of edges. This means that every node is connected to every other node including itself.

Dense Graph Dense Graph Dense Graph as Matrix Dense Graph as Matrix Dense Graph as List Dense Graph as List

Sparse

The following is a sparse graph.

Sparse Graph Sparse Graph Sparse Graph as Matrix Sparse Graph as Matrix Sparse Graph as List Sparse Graph as List

List or Matrix?

For dense graphs, the matrix representation will have better qualities as we are already setting aside space for the maximum number of edges. Sparse graphs are better represented in the list representation.

When we initialize the matrix implementation, we initialize the nodes attribute to have dimension equal to the capacity of the graph. The edges attribute is initialized to be a square matrix with dimension equal to capacity by capacity. Thus, if we have a sparse matrix, we are representing a lot of non-existent edges.

When we initialize the list implementation, we just have the nodes attribute which has dimension equal to the capacity and each node tracks its own edges. If we have a dense matrix and we are searching for an edge, we must loop through each edge from the target node to see if the edge exists. In the matrix representation, we can access that edge directly.

Info

If the proportion of edges to the maximum number of edges is greater than 1/64, then the matrix representation is better in terms of space.

UML - Graph Node

YouTube Video

In this representation, we will have an array of graph node objects. We will first cover the UML for the graph node objects and then discuss the graph functions and attributes.

List Graph UML for Nodes List Graph UML for Nodes

Attributes

  • item: the value that the node contains.
  • index: the index of the node.
  • edges: ordered pairs (e, w) where this node is the source, e is the target node index, and w is the weight of the edge as a double.

We will initialize a graph node with the given item and the given index. We initialize the edges attribute to be an empty list.

Getters

  • get item: Returns the graph node’s item.

  • get index: Returns the graph node’s index.

  • get edges: Returns the graph node’s edges.

  • get edge: From the source node, we will call the get edge function with the index of the target node as input. This will return the edge weight.

function GETEDGE(TARINDEX)
    for EDGE in nodes EDGES
        if the first element in EDGE is TARINDEX
            return the second element in EDGE
    return infinity 

Edge Functions

Info

Working with the edges in our graph becomes slightly more complicated in the list representation. Previously, we were able to go right to the entry in our adjacency matrix and update it. Since each node keeps track of its own edges in no particular order, we must loop through each entry of the edges attribute to find a potential edge.

  • add edge: From the source node, we will call the add edge function with the target node as input as well as the weight. First, we will attempt to remove the edge. We need to do this as we do not want duplicate edges in our graph. Then we will add the ordered pair to the edges attribute.
function ADDEDGE(TARINDEX, WEIGHT)
    call REMOVEEDGE(TARINDEX) on this node
    append (TARINDEX, WEIGHT) to this nodes EDGES 
  • remove edge: From the source node, we will call the remove edge function with the target node as input. This will return true if it was successful and false if not.
function REMOVEEDGE(TARINDEX)
    for EDGE in nodes EDGES
        if the first element in EDGE is TARINDEX
            remove EDGE from EDGES
            return true
    return false 

UML - Graph

YouTube Video

List Graph UML List Graph UML

Attributes

  • nodes: This will keep track of the nodes which are in our graph as well as the node values. The nodes can have any type of value such as numbers, characters, and even other data structures.
  • size: This will keep track of the number of nodes that are active in our graph.

Upon initialization, we will initialize nodes to be an empty array with dimension capacity and size to be zero as we start with no actual nodes.

Getters

  • get nodes: returns a list of the nodes with their respective indexes. This will be the same logic from our matrix graph.
function GETNODES()
    LIST = []
    for NODE in NODES
        if NODE has a VALUE
            append (VALUE, INDEX) to LIST
    return LIST
  • get edges: returns a list of the edges in the format (source, target, weight).
function GETEDGES()
    LIST = []
    for NODE in NODES
        if NODE is not empty
            for EDGE in NODE EDGES
                TAR = first entry of EDGE
                WEIGHT = second entry of EDGE
                append (NODE,TAR,WEIGHT) to LIST
    return LIST
  • get node: returns the node with the given index. If the index is within the possible range, then we return the value of that node. This will be the same logic from our matrix graph.

  • find node: returns the index of the given node. We iterate through our nodes and if we find that value, then we return the index. Otherwise, return -1. This will be the same logic from our matrix graph.

  • get edge: returns the weight of the edge between the given indexes of the source node and target node. If one or both of the indexes are out of range, then we should return infinity. From the source node object, we will call the graph node get edge function on the target index.

function GETEDGE(SRC,TAR)
    if SRC and TAR are between 0 and capacity
        SRCNODE = the node at index SRC of the NODES attribute
        WEIGHT = call the graph node GETEDGE from SRCNODE on TAR
        return WEIGHT
    else
        return infinity
  • get capacity: returns the maximum number of nodes we are allowed to have. Upon initialization, we will have a fixed number of possible nodes in our node array. We can simply return the size of this array. This will be the same logic from our matrix graph.

  • get size: returns the size attribute. This will be the same logic from our matrix graph.

  • get number of edges: returns the number of edges currently in the graph.

function NUMBEROFEDGES()
    COUNT = 0
    for NODE in NODES
        if NODE is not empty
            for EDGE in NODE EDGES
                increment COUNT by one
    return COUNT
  • get neighbors: returns the neighbors of the given node. We will access our row adjacency matrix that corresponds to the node and return the indexes and values of those entries which are not infinity.
function GETNEIGHBORS(IDX)
    SRCNODE = the node at index IDX of the NODES attribute
    if SRCNODE is not empty
        return SRCNODE's edges 
    else
        return nothing
        

Node and Edge Functions

  • add node: will add a node to the graph with the given value if our graph still has room. Finding a location for the node will be the same procedure as the matrix graph. If we find an open spot to add the node, we will instantiate a new graph node and insert it into the nodes attribute.
function ADDNODE(VALUE)
    IDX = -1
    for NODE in NODES
        if NODE is VALUE 
            return NODE's index
        if NODE has no entry and IDX is -1
            IDX = NODE's index
    if IDX is not -1
        NEWNODE = graph node with VALUE and IDX for input
        add NEWNODE to NODES at position IDX
        increment SIZE
    return IDX
  • remove node: will remove a node to the graph with the given value if our graph has the node. We will set the node to be empty. When we set the node to be empty, we clear all of the outgoing edges, so we just need to loop through the other nodes removing any possible incoming edges.
function REMOVENODE(IDX)
    if IDX is in the range of our indexes 
        if NODES at position IDX is not empty
            set NODES at IDX to be empty
            decrement SIZE by one
            for NODE in NODES
                if NODE has no entry
                    from NODE call the graph node REMOVEEDGE function on IDX
            return true
        else
            return false
    else
        return false 
  • add edge: will add an edge with the given weight which goes from the source node to the target node
function ADDEDGE(SRC, TAR, WEIGHT)
    if SOURCE and TARGET are both in the range of our node indexes
        SRCNODE = the node at index SRC of the NODES attribute
        if SRCNODE is not empty
            from SRCNODE call the graph node ADDEDGE with TAR and WEIGHT as input
            return true 
        else
            return false
    else
        return false
  • remove edge: will remove the edge which goes from the source node to the target node
function REMOVEEDGE(SOURCE, TARGET)
    if SOURCE and TARGET are both in the range of our node indexes
        SRCNODE = the node at index SRC of the NODES attribute
        if SRCNODE has no entry
            RET = SRCNODE call the graph node REMOVEEDGE with TAR as input
            return RET 
        else
            return false
    else
        return false
  • add undirected edge: will add two edges with the given weight between the two given nodes
function ADDUNDIRECTEDEDGE(NODE1, NODE2, WEIGHT)
    RES = ADDEDGE(NODE1, NODE2, WEIGHT)
    RES = RES and ADDEDGE(NODE2, NODE1, WEIGHT)
    return RES
  • remove undirected edge: will remove two edges between the two given nodes.
function REMOVEUNDIRECTEDEDGE(NODE1, NODE2)
    RES = REMOVEEDGE(NODE1, NODE2)
    RES = RES and REMOVEEDGE(NODE2, NODE1)
    return RES

Summary

Example 1 Example 1

In this module, we introduced a new way to store the graph data structure. Thus, we now have two ways to work with graphs, in lists and in matrices:

List Representation

List Representation for Example 1 List Representation for Example 1

Matrix Representation

Matrix Representation for Example 1 Matrix Representation for Example 1

While these methods show the same information, there are cases when one way may be more desirable than the other.

We discussed how a sparse graph is better suited for a list representation and a dense graph is better suited for a matrix representation. We also touched on how working with the edges in a list representation can add complexity to our edge functions. If we are needing to access edge weights or update edges frequently, a matrix representation would be a good choice – especially if we have a lot of nodes.

Welcome!

This page is the main page for Graphs: Searching and Traversing

Subsections of Graphs: Searching and Traversing

Introduction

YouTube Video

In the previous modules, we have introduced graphs and two implementations. This module will cover the traversals through graphs as well as path search techniques.

Motivation

As we have discussed previously, graphs can have many applications. Based on that, there are many things that we may want to infer from graphs. For example, if we have a graph that depicts a railroad or electrical network, we could determine what maximum flow of the network. The standard approach for this task is the Ford-Fulkerson Algorithm . In short, given a graph with edge weights that represent capacities the algorithm will determine the maximum flow throughout the graph.

From the matrix graph module, we used the following distribution network as an example. Sample Distribution Network Sample Distribution Network

Conceptually, we would want to determine the maximum number of units that could leave the distribution center without having excess laying around stores. Using the maximum flow algorithm, we would determine that the maximum number of units would be 15.

Max Flow Max Flow

The driving force in the Ford-Fulkerson algorithm, as well as other maximum flow algorithms, is the ability to find a path from a source to a target. Specifically, these algorithms use breadth first and depth first searches to discover possible paths.

Searches

To get to introducing the searches, we will first discuss the basis of them. Those are the depth first traversal and the breadth first traversal. We will outline the premise of these traversals and then discuss how we can modify their algorithms for various tasks, such as path searches.

We can perform these traversals on any type of graph. Conceptually, it will help to have a tree-like structure in mind to differentiate between depth first and breadth first.

Breadth VS Depth Breadth VS Depth

Depth First

YouTube Video

Video Slides

First we will discuss Depth First Traversal. We can define the depth first traversal in two ways, iteratively or recursively. For this course, we will define it iteratively.

In the iterative algorithm, we will initialize an empty stack and an empty set. The stack will determine which node we search next and the set will track which nodes we have already searched.

Info

Recall that a stack is a ‘Last In First Out’ (LIFO) structure. Based on this, the depth first traversal will traverse a nodes descendants before its siblings.

To do the traversal, we must pick a starting node; this can be an arbitrary node in our graph. If we were doing the traversal on a tree, we would typically select the root at a starting point. We start a while loop to go through the stack which we will be pushing and popping from. We get the top element of the stack, if the node has not been visited yet then we will add it to the set to note that we have now visited it. Then we get the neighbors of the node and put them onto the stack and continue the process until the stack is empty.

DFS Example GIF DFS Example GIF

function DEPTHFIRST(GRAPH,SRC)
    STACK = empty array
    DISCOVERED = empty set
    append SRC to STACK
    while STACK is not empty
        CURR = top of the stack
        if CURR not in DISCOVERED
            add CURR to DISCOVERED
            NEIGHS = neighbors of CURR
            for EDGE in NEIGHS
                NODE = first entry in EDGE
                append NODE to STACK

Since the order of the neighbors is not guaranteed, the traversal on the same graph with the same starting node can find nodes in different orders.

Breadth First

YouTube Video

Video Slides

We can also perform a breadth first traversal either iteratively or recursively. As with the depth first traversal, we will define it iteratively.

In the iterative algorithm, we initialize an empty queue and an empty set. Like depth first traversal, the set will track which nodes we have discovered. We now use a queue to track which node we will search next.

Info

Recall that a queue is a ‘First In First Out’ (FIFO) structure. Based on this, the breadth first traversal will traverse a nodes siblings before its descendants.

Again, we must pick a starting node; this can be an arbitrary node in our graph. We add the starting node to our queue and the set of discovered nodes. We start a while loop to go through the queue which we will be enqueue and dequeue from. We get the first element of the queue, then get the neighbors of the current node. We loop through each edge adding the neighbor to the discovered set and the queue if it has not already been discovered. We continue this process until the queue is empty.

BFS Example GIF BFS Example GIF

function BREADTHFIRST(GRAPH,SRC)
    QUEUE = empty queue
    DISCOVERED = empty set
    add SRC to DISCOVERED
    add SRC to QUEUE
    while QUEUE is not empty
        CURR = first element in QUEUE
        NEIGHS = neighbors of CURR
        for EDGE in NEIGHS
            NODE = first entry in EDGE
            if NODE is not in DISCOVERED
                add NODE to DISCOVERED
                append NODE to QUEUE
Info

It is important to remember in these implementations that a stack is used for depth first and a queue is used for a breadth first. The stack, being a LIFO structure, will proceed with the newest entry which will put us farther away from the source. The queue, being a FIFO structure, will proceed with oldest entry which will focus the algorithm more on the adjacent nodes. If we were to use say a queue for a depth first search, we would be traversing neighbors before descendants.

Limitations

YouTube Video

When introducing graphs, we discussed how the components of a graph didn’t have to all be connected. If our goal is to visit each node, like in the searches, then we will need to perform the search from every node.

For example, the graph below has two separate components. Lets walk through which nodes we will discover by calling the traversals from each node.

Disconnected Graph Disconnected Graph

Start Visited (in alphabetical order)
A {A, D, H}
B {B, E, H, I}
C {C}
D {D}
E {E, H, I}
F {C, F}
G {C, G}
H {H}
I {I}
J {C, F, G, J}

In this example, we would need to call either traversal on nodes A, B and J in order to visit all of the nodes.

Finding a Path

YouTube Video

An important application for these traversals is the ability to find a path between two nodes. This has many applications in railroad networks as well as electrical wiring. With some modifications to the traversals, we can determine if electricity can flow from a source to a target. We will modify depth first and breadth first traversals in similar ways.

Info

There are three cases that can happen when we search for a path between nodes:

  • No Path: will return nothing
  • One Path: will return the path
  • Multiple Paths: will return A path

With these searches, we are not guaranteed to return the same path if there are multiple paths.

We will call these Depth First Search (DFS) and Breadth First Search (BFS). In both traversals, we have added the following extra lines: 4, 9-16, and 22 through the end.

First, we have the addition of PARENT_MAP which will be a dictionary to keep track of how we get from one node to another. We will use the convention of having the key be the child and the value be the parent. While we use the terms child and parent, this is not exclusive to trees.

The ending portion starting at line 22, will add entries to our dictionary. If we haven’t already found an edge to NODE, then we will add the edge that we are currently on.

The other addition is the block of code from line 9 to 16. We will enter this if block if the node that we are currently at is the target. This means that we have finally found a path from the source node to the target node. The process in this segment of code will backtrack through the path and build the path.

Depth First Search (DFS)

1. function DEPTHFIRSTSEARCH(GRAPH,SRC,TAR)
2.     STACK = empty array
3.     DISCOVERED = empty set
4.     PARENT_MAP = empty dictionary
5.     append SRC to STACK
6.     while STACK is not empty
7.         CURR = top of the stack
8.         if CURR not in DISCOVERED
9.             if CURR is TAR
10.                 PATH = empty array
11.                 TRACE = TAR
12.                 while TRACE is not SRC
13.                     append TRACE to PATH
14.                     set TRACE equal to PARENT_MAP[TRACE]
15.                 reverse the order of PATH
16.                 return PATH
17.            add CURR to DISCOVERED
18.            NEIGHS = neighbors of CURR
19.            for EDGE in NEIGHS
20.                NODE = first entry in EDGE
21.                append NODE to STACK
22.                if PARENT_MAP does not have key NODE
23.                    in the PARENT_MAP dictionary set key NODE with value CURR
24.    return nothing

DFS Example

DFS Example GIF DFS Example GIF

Breadth First Search (BFS)

1. function BREADTHFIRSTSEARCH(GRAPH,SRC,TAR)
2.     QUEUE = empty queue
3.     DISCOVERED = empty set
4.     PARENT_MAP = empty dictionary
5.     add SRC to DISCOVERED
6.     add SRC to QUEUE
7.     while QUEUE is not empty
8.         CURR = first element in QUEUE
9.         if CURR is TAR 
10.            PATH = empty list 
11.            TRACE = TAR
12.            while TRACE is not SRC
13.                    append TRACE to PATH
14.                    set TRACE equal to PARENT_MAP[TRACE]
15.                reverse the order of PATH
16.                return PATH
17.        NEIGHS = neighbors of CURR
18.        for EDGE in NEIGHS
19.            NODE = first entry in EDGE
20.            if NODE is not in DISCOVERED
21.                add NODE to DISCOVERED
22.                if PARENT_MAP does not have key NODE
23.                    in the PARENT_MAP dictionary set key NODE with value CURR
24.                append NODE to QUEUE
25.    return nothing

BFS Example

BFS Example GIF BFS Example GIF

In Practice

Traveling

Finding a path in a graph is a very common application in many fields. One application that we benefit from in our day to day lives is traveling. Programs like Google Maps calculate various paths from point A to point B.

Google Map Google Map ^[google.com/maps]

In the context of graph data structures, we can think of each intersection as a node and each road as an edge. Google Maps, however, tracks more features of edges than we have discussed. Not only do they track the distance between intersections, they also track time, tolls, construction, road surface and much more. In the next module, we will discuss more details of how we can find the shortest path.

Map Coloring

Another application of the general searches is coloring maps. The premise is that we don’t want two adjacent territories to have the coloring. These territories could be states, like in the United States map below, counties, provinces, countries, and much more.

US Map US Map ^[https://commons.wikimedia.org/wiki/File:Map_of_USA_showing_state_names.png]

The following was generated for this course using the breadth first search and MyMatrixGraph class that we have implemented in this course. To create the visual rendering, the Python library NetworkX^[https://networkx.github.io/] was used. In this rendering, the starting node was Utah. If we were to start from say Alabama or Florida, we would not have a valid four coloring scheme once we got to Nevada. Since Hawaii and Alaska have no land border with any of the states, they can be any color.

Color Generation Color Generation

Welcome!

This page is the main page for Graphs: Minimum Spanning Trees

Subsections of Graphs: Minimum Spanning Trees

Introduction

YouTube Video

We will continue to work with graph algorithms in this module, specifically with finding minimum spanning trees (MST). MSTs have many real world applications such as:

  • Electrical wiring,
  • Distribution networks,
  • Telecommunication networks, and
  • Network routing

Suppose we were building an apartment complex and wanted to determine the most cost-effective wiring schema. Below, we have the possible construction costs for wiring apartment to apartment. Wiring vertically adjacent apartments is cheaper than wiring horizontally adjacent units and those closest to the power closet have lower costs as well. Possible Wiring Possible Wiring

To find the best possible solution, we would find the MST. The final wiring schema may look something like the figure below. Possible Wiring Possible Wiring

Determining a MST can result in lower costs and time used in many applications, especially logistics. To properly define a minimum spanning tree, we will first introduce the concept of a spanning tree.

Spanning Trees

A spanning tree for a graph is a subset of the graphs edges such that each node is visited once, no cycles are present, and there are no disconnected components.

Let’s look at this graph as an example. We have five nodes and seven edges.

Graph Graph

Below, we have valid examples of spanning trees. In each of the examples, we visit each node and there are no cycles. Recall that a cycle is a path in which the starting node and ending node are the same.

Spanning Trees Spanning Trees

Info

To be a spanning tree of a graph, it must:

  • span the graph, meaning all nodes must be visited, and
  • be a tree, meaning there are no cycles and no disconnected components.

Further, we can imagine selecting a node in a spanning tree as the root and letting gravity take effect. This gives us a visual motivation as to why they are called spanning trees. In these examples, we have selected node A for the root for each of the spanning trees above.

Spanning Trees as Trees Spanning Trees as Trees

Counterexamples

Below, we have invalid examples of spanning trees. In the left column, the examples are where all of the nodes are not connected in the same component. In the right column, the examples contain cycles. For example in the top right, we have the cycle B->C->D->E->B

Not Spanning Trees Not Spanning Trees

Minimum Spanning Trees

YouTube Video

Now that we have an understanding of general spanning trees, we will introduce the concept of minimum spanning trees. First let’s introduce the concept of the cost of a tree.

The cost that is associated with a tree, is the sum of its edges weights. Let’s look at this spanning tree which is from the previous page. The cost associated with this spanning tree is: 2+6+10+14=32. Graph Graph

Minimum Spanning Trees (MST)

A minimum spanning tree is a spanning tree that has the smallest cost. Recall the graph from the previous page.

Graph Graph

Below on the left is a minimum spanning tree for the graph above. On the right is an example of a spanning tree, though it does not have the minimum cost.

Minimum Spanning Tree Minimum Spanning Tree

In this small example, it is rather straightforward to find the minimum spanning tree. We can use a bit of trial and error to determine if we have the minimum spanning tree or not. However, once the graphs start to get more nodes and more edges it quickly becomes more complicated.

Graph Graph

There are two algorithms that we will introduce to give us a methodical way of finding the minimum spanning tree. The first that we will look at is Kruskal’s algorithm and then we will look at Prim’s algorithm.

Kruskal

YouTube Video

As graphs get larger, it is important to go about finding the MST in a methodical way. In the mid 1950’s, there was a desire to form an algorithmic approach for solving the ’traveling salesperson’ problem^[We will describe this problem in a future section of this module]. Joseph Kruskal first published this algorithm in 1956 in the Proceedings of the American Mathematical Society^[https://www.ams.org/journals/proc/1956-007-01/S0002-9939-1956-0078686-7/S0002-9939-1956-0078686-7.pdf]. The algorithms prior to this were, as Kruskal said, “unnecessarily elaborate” thus the need for a more succinct algorithm arose.

Algorithm

In his original work, Kruskal outlined three different yet similar algorithms to finding a minimum spanning tree. The Kruskal Algorithm that we use is as follows:

  1. Start with only the nodes of the graph and an empty set for the edges
  2. Order the edges based on weight
  3. Make each node their own set
  4. Go through the edges in ascending order
  5. If nodes u and v are connected by the edge and they are not in the same set yet, then join the two sets and add the edge to your set of edges

Starting Graph

Kruskal Example Start Kruskal Example Start

Kruskal Example GIF Kruskal Example GIF

Resulting MST

Kruskal Example Result Kruskal Example Result

Pseudocode

function KRUSKAL(GRAPH)
    MST = GRAPH without the edges attribute(s)
    ALLSETS = an empty list which will contain the sets
    for NODE in GRAPH NODES
        SET = a set with element NODE
        add SET to ALLSETS
    EDGES = list of GRAPH's edges
    SORTEDEDGES = EDGES sorted by edge weight, smallest to largest
    for EDGE in SORTEDEDGES
        SRC = source node of EDGE
        TAR = target node of EDGE
        SRCSET = the set from SETS in which SRC is contained
        TARSET = the set form SETS in which TAR is contained
        if SRCSET not equal TARSET
            UNIONSET = SRCSET union TARSET
            add UNIONSET to ALLSETS
            remove SRCSET from ALLSETS
            remove TARSET from ALLSETS
            add EDGE to MST as undirected edge
    return MST

Prim

YouTube Video

The history of Prim's Algorithm is not as straight forward as Kruskal’s. While we often call it Prim's Algorithm, it was originally developed in 1930 by Vojtěch Jarník. Robert Prim later rediscovered and republished this algorithm in 1957, one year after Kruskals. To add to the naming confusion, Edsger Dijkstra also published this work again in 1959. Because of this, the algorithm can go by many names: Jarkík's Algorithm, Jarník-Prim's Algorithm, Prim-Dijkstra's Algorithm, and DJP Algorithm.

Prim cited “large-scale communication” as the motivation for this algorithm, specifically the “Bell System leased-line”^[R.C. Prim, May 8, 1957 Shortest Connection Networks And Some Generalizations https://archive.org/details/bstj36-6-1389]. Leased lines were used primarily in a commercial setting which connected business offices that were geographically distant (IE in different cities or even states). Companies would want all offices to be connected but wanted to avoid having to lay an excessive amount of wire. Below is a figure which Prim used to motivate the need for the algorithm. This image depicts the minimum spanning tree which connect each of the US continental state capitals along with Washington D.C.

Prim’s Motivation Prim’s Motivation

Algorithm

The basis of the algorithm is to start with only the nodes of the graph, then we do the following

  1. Choose a random node
  2. Grow your tree by one edge, selecting the smallest edge to connect to a node that is not yet in the tree. Repeat until all the nodes have been visited

Starting Graph

Prim Example Start Prim Example Start

Prim Example GIF Prim Example GIF

Resulting MST

Prim Example Result Prim Example Result

Info

Uniqueness

You may have noticed that the minimum spanning tree that resulted from Kruskal’s algorithm differed from Prim’s algorithm. We have displaying them both below for reference.

Kruskal Prim
Kruskal Example Result Kruskal Example Result Prim Example Result Prim Example Result

While these are different, they are both valid. The trees both have cost 16. The MST of a graph will be unique, meaning there is only one, if none of the edges of the graph have the same weight.

Pseudocode

function PRIM(GRAPH, START)
    MST = GRAPH without the edges attribute(s)
    VISITED = empty set
    add START to VISITED
    AVAILEDGES = list of edges where START is the source
    sort AVAILEDGES
    while VISITED is not all of the nodes
        SMLEDGE = smallest edge in AVAILEDGES
        SRC = source of SMLEDGE
        TAR = target of SMLEDGE
        if TAR not in VISITED
            add SMLEDGE to MST as undirected edge
            add TAR to VISITED
            add the edges where TAR is the source to AVAILEDGES
        remove SMLEDGE from AVAILEDGES
        sort AVAILEDGES
    return MST

Traveling Salesperson

YouTube Video

YouTube Video

While we won’t outline algorithms suited for solving the traveling salesperson problem (TSP), we will outline the premise of the problem. This problem was first posed in 1832, almost a two centuries ago, and is still quite prevalent. It is applicable to traveling routes, distribution networks, computer architecture and much more. The TSP is a seminal problem that has motivated many research breakthroughs, including Kruskals algorithm!

The motivation of the TSP is this: given a set of locations, what is the shortest path such that we can visit each location and end back where started?

Suppose we wanted to take a roadtrip with friends to every state capital in the continental US as well as Washington D.C. To save money and time, we would want to minimize the distance that we travel. Since we are taking a roadtrip, we would want to avoid frivolous driving. For example, if we start in Sacremento, CA we would not want to end the trip in Boston, MA. The trip should start and end at the same location for efficiency.

The figure below shows the shortest trip that visits each state capital and Washington D.C. once. In this example, we can start where ever we like and will end up where we started. Visit each state capital Visit each state capital ^[PatriciaNeri, August 2018 https://communities.sas.com/t5/SAS-Communities-Library/What-is-the-shortest-tour-that-visits-only-once-the-48/ta-p/490231]

In this problem, it is easy to get overwhelmed by all of the possibilities. Since there are 49 cities to visit, there are over 6.2*10^60 possibilities. For reference, 10^12 is equivalent to one trillion! Thus, we need an algorithmic approach to solve this problem as opposed to a brute force method.

Iv-priority-queues

Subsections of Iv-priority-queues

Welcome!

This page is the main page for Heaps and Priority Queues

Subsections of Heaps and Priority Queues

Introduction

YouTube Video

The next data structure we will cover is heaps. The heaps discussed in this course are not to be confused with heaps which refer to garbage collection in certain coding languages. Heaps are good for situations were we will need to frequently access and update the highest (or lowest) priority item in a set. For example, heaps are a good data structure to use in Prim’s algorithm. In Prim’s algorithm, we repeatedly got the smallest edge, removed the smallest edge, and then added to and sorted the list of edges.

Heap Properties

A heap is an array which we can view as an unsorted binary tree. This tree must have the following properties:

  1. Each node has at most two children.
  2. If there are nodes in level i of the tree, then level i-1 is full. Below we have an example of how this property has been broken. Level two is not full but there are nodes on level three.
  3. The nodes of the last level are as far left as possible. Below we have an example of how this property has been broken.

Info

As a consequence of the above properties, the following is true as well: Only one node can have one child, all other nodes will have zero or two children. Try to construct a counterexample to see what we mean!

Types of Heaps

There are two main types of heaps, the max-heap and the min-heap. Depending on the element we want to access we may use one or the other.

A max-heap is a heap such that the parent node is greater than or equal to the children. For example, if we are using a heap to track work flow,we would want to use a max-heap. In this case, the highest priority element will always be the root of the tree.

A min-heap is a heap such that the parent node is less than or equal to the children. This is the opposite of the max-heap. The root of this heap will be the item with the lowest priority. A min-heap may feel unnatural at first, however, this is ideal for greedy algorithms such as Prim’s algorithm. We are frequently getting the smallest edge.

Node Relationships

YouTube Video

Heaps can be viewed in two forms: as a tree or as an array. We will use the array style in code but we can have the tree structure in the back of our mind to help understand the order of the data. Here is an example of the heap as a tree on the left and the heap as an array on the right. The root of the heap will always be the first element. Then we can base the numbering of the following nodes from left to right and top to bottom. For example, the left child of the root will be the second entry and the right child will be the third.

Accessing Indexes

For full functionality of our heap, we want to be able to easily determine the parent of a node as well as the children of a node.

Info

Critical Thinking

Using just the array, how can we determine the parent of a node? In the example above, how can we determine the parent of the node with value 18?

We can formulate the relationships between parent and children nodes mathematically. For a node at index i, we can say that the left child of i will be at index 2i and the right child will be at 2i+1. Similarly, we can say that the parent of node i will be at index floor(i/2).

Info

The function floor(x) like in floor(i/2) will round decimal values down to the next whole number. Some examples:

  • floor(3.2)=3
  • floor(1.9999)=1
  • floor(4)=4

Node Parent Left Child Right Child
i floor(i/2) 2i 2i + 1
1 N/A 2*1=2 2*1+1=3
2 floor(2/2)=1 2*2=4 2*2+1=5
3 floor(3/2)=1 2*3=6 2*3+1=7
4 floor(4/2)=2 2*4=8 2*4+1=9
5 floor(5/2)=2 2*5=10 2*5+1=11
Try it!

Consider the following example and try to work some out for yourself.

For example, if we ask for the parent of the node with value 27, our answer would be the node with value 35 The node with value 27 has index 5. Thus, the parent of that node will have index floor(5/2)=2. Node 35 is at index two, as such, node 35 is the parent of node 27.

Left child of the node with value 24:The node with value three. Node 24 has index 4, so the left child will be at index 2*4=8. That corresponds to the node with value 3.
Right child of the node with value 44:The node with value twelve. Node 44 has index 3, so the right child will be at index (2*3)+1=7. That corresponds to the node with value 12.
Parent of the node with value 36:The node with value forty-four. Node 36 has index 6, so the parent will be at index floor(6/2)=3. That corresponds to the node with value 44.
Info

In a heap where n is the size of heap, the elements floor(n/2)+1 through n will always be leaves. If we assume that we have just the 12 elements in this example, then based on this formula, elements 7 through 12 must be leaves. We can verify this in the tree representation!

Priority Queues

YouTube Video

A natural implementation of heaps is priority queues.

A priority queue is a data structure which contains elements and each element has an associated key value. The key for an element corresponds to its importance. In real world applications, these can be used for prioritizing work tickets, emails, and much more.

We can use a heap to organize this data for us.

As with heaps, we can have min-priority queues and max-priority queues. For the applications listed above, a max-priority queue is the most intuitive choice. For this course however, we will focus more on min-priority queues which will give us better functionality for greedy algorithms, like Prim’s algorithm.

Prim’s Revisited

For the minimum spanning tree algorithms, using a min-priority queue helps the performance of the algorithms. Recall Prim’s algorithm, shown below. Each time we visited a new node, we would add the outgoing edges to the list of available edges, remove the smallest edge, and sort the list.

function PRIM(GRAPH, SRC)
    MST = GRAPH without the edges attribute(s)
    VISITED = empty set
    add SRC to VISITED
    AVAILEDGES = list of edges where SRC is the source
    sort AVAILEDGES
    while VISITED is not all of the nodes
        SMLEDGE = smallest edge in AVAILEDGES
        SRC = source of SMLEDGE
        TAR = target of SMLEDGE
        if TAR not in VISITED
            add SMLEDGE to MST as undirected edge
            add TAR to VISITED
            add the edges where TAR is the source to AVAILEDGES
        remove SMLEDGE from AVAILEDGES
        sort AVAILEDGES
    return MST

If we implement Prim’s algorithm with min-priority queue, we don’t have to worry about sorting the edges every time we add or remove one.

function PRIM(GRAPH, SRC)
    MST = GRAPH without the edges attribute(s)
    VISITED = empty set
    add SRC to VISITED
    AVAILEDGES = min-PQ of edges where SRC is the source
    while VISITED is not all of the nodes
        SMLEDGE = smallest edge in AVAILEDGES
        SRC = source of SMLEDGE
        TAR = target of SMLEDGE
        if TAR not in VISITED
            add SMLEDGE to MST as undirected edge
            add TAR to VISITED
            add the edges where TAR is the source to AVAILEDGES
        remove SMLEDGE from AVAILEDGES
    return MST

Functionality

YouTube Video

Attributes

The priority queue will have a single attribute which will be the array to represent the priority queue as a heap. Elements of this array will be ordered pairs where the first entry is the priority and the second entry is the node item.

Info

Since arrays start at index zero, we will set the first element equal to a null value and the element at index 1 will be the start of our priority queue.

Size

Similar to graphs, heaps will have a size which will be the number of elements currently in our heap.

Push

The PUSHDOWN and PUSHUP functions will help us to maintain the heap structure. In their own ways, described below, they will correct the ordering of nodes. Primarily making it such that the parent is always smaller then its children. Both of these will be recursive functions.

  • PUSHUP: This function takes an index as input and then determines if the element at that index has a lower priority than its parent. This will ‘raise’ the element up through heap and move it closer to the front of the array.
    • Base Case: There are two base cases for PUSHUP. The first is if the element has greater priority than its parent, then we do nothing as the structure is correct. The second is if the index of the parent node is 0, then we do nothing as we have reached the root of the heap.
    • Recursive Case: When we are not at the root and the child has lower priority than the parent, then we will swap the parent element and the child element in the priority queue’s array. Then we will call the PUSHUP function with the parent index as input.
  • PUSHDOWN: Similar to PUSHUP this function will take an index as input but will now determine if the node has higher priority than one of its children. This will ’trickle’ the element down through the heap and move it closer to the end of the array.
    • Base Case: There are two cases for PUSHDOWN. The first is if the the parent has lower priority than both of its children, then we do nothing as the structure is correct. The second is if the index of both children are out of the range of our array, then we do nothing as we have reached a leaf.
    • Recursive Case: When we are not at a leaf and at least one child has lower priority than the parent, then we will recurse. We will break this down into further cases for clarity.
      • Parent has a Right Child: If the priority of the left child is less than the priority of the right child, then we check the priority of the left child compared to the parent. If the left child has lower priority than the parent, then we swap the parent and the left child in the array and recurse on the left child. If the priority of the left child is not less than the priority of the right child, then we check the priority of the right child compared to the parent. If the right child has lower priority than the parent, then we swap the parent and the right child in the array and recurse on the right child.
      • Parent only has a Left Child: If the priority of the left child is less than the parent’s priority, then swap the left child and the parent in the array and then recurse on the left child.

Remove Min

This will remove the lowest priority element from our priority queue. If our priority queue only has one element, then we will just remove the element. However, if there is more than the single element, we will need to do some maintenance to make sure the lowest priority element is the root again. To do this, we will take the last element of the priority queue and make it the root. Then we will use the PUSHDOWN function to reorder the nodes.

Heapify

The HEAPIFY function will allow us to translate our data into a heap. It will take as input a list of priorities and a list of items. Suppose in the figure below, we are calling Prim’s function from node 1. Thus, we want to start our heap with the outgoing edges of node 1. We will input the list of priorities, which in this case are the edge weights, and the list of the items. For this application, we have made the items ordered pairs where the first entry is the source node and the second is the target.

Since we are working primarily with min-priority queues, we will define HEAPIFY in those terms. Though, we could have an equivalent function for a max-priority queue.

For input, the function takes RANKS which is the array representation of our priority values and ITEMS which is the array representation of the items.

function HEAPIFY(RANKS, ITEMS)
    if RANKS and ITEMS are the same size
        SIZE = length of ITEMS
        loop INDEX starting at 1 to SIZE
            I_RANK = value at INDEX in RANKS
            I_ITEM = item at INDEX of ITEMS 
            append (I_RANK, I_ITEM) to priority queues array
        QSIZE = length of our PQ - 1
        LASTPARENT = floor(QSIZE/2) + 1
        loop NODE starting at LASTPARENT down to 1
            PUSHDOWN(NODE)
    else
        error

Insert

The INSERT function is similar to HEAPIFY but now we are just wanting to insert a single element. As such, INSERT will take the priority of the element and the element as input. We will append the ordered pair of (priority, element) to our priority queue array and then call the PUSHUP function on the last element of the array.

Heap Decrease Key

The DECREASEKEY function allows us to update the priority of an element. It takes as input the array index for heap entry to update as well as the priority we will be giving the element. We first check if that the new priority is in fact less than the original. Then we update the priority and then push the element up.

DECREASEKEY(IDX, PRIORITY)
    ELEMENT = entry in array at IDX
    if PRIORITY > ELEMENT[0]
        error
    ELEMENT[0] = PRIORITY 
    PUSHUP(IDX)
   

Dijkstras

YouTube Video

Video Slides

A good application of priority queues is finding the shortest path in a graph. A common algorithm for this is Dijkstra’s algorithm.

Edsger Dijkstra was a Dutch computer scientist who researched many fields. He is credited for his work in physics, programming, software engineering, and as a systems scientist. His motivation for this algorithm in particular was to be able to find the shortest path between two cities.

Info

“What is the shortest way to travel from Rotterdam to Groningen, in general: from given city to given city? It is the algorithm for the shortest path, which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path.” - Edsger Dijkstra, Communications of the ACM 53 (8), 2001.

His original algorithm was defined for a path between two specific cities. Since its publication, modifications have been made to the algorithm to find the shortest path to every node given a source node.

DIJKSTRAS(GRAPH, SRC)
    SIZE = size of GRAPH
    DISTS = array with length equal to SIZE
    PREVIOUS = array with length equal to SIZE
    set all of the entries in PREVIOUS to none
    set all of the entries in DISTS to infinity 

    DISTS[SRC] = 0 
    PQ = min-priority queue

    loop IDX starting at 0 up to SIZE
        insert (DISTS[IDX],IDX) into PQ

    while PQ is not empty
        MIN = REMOVE-MIN from PQ
        for NODE in neighbors of MIN
            WEIGHT = graph weight between MIN and NODE
            CALC = DISTS[MIN] + WEIGHT
            if CALC < DISTS[NODE]
                DISTS[NODE] = CALC
                PREVIOUS[NODE] = MIN
                PQIDX = index of NODE in PQ
                PQ decrease-key (PQIDX, CALC)
    return DISTS and PREVIOUS

Animated demo of Dijkstras Algorithm Animated demo of Dijkstras Algorithm ^[Shiyu Ji, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0 , via Wikimedia Commons, https://upload.wikimedia.org/wikipedia/commons/e/e4/DijkstraDemo.gif]

Aside from just finding routes for us to travel, Dijkstra’s algorithm can accommodate for any application that can have an abstraction to finding the shortest path. For example, the following animation shows how a robot could utilize Dijkstra’s algorithm to find the shortest path with an obstacle in the way. In this example, each node could represent one square foot of floor space and the edges would represent those spaces that are adjacent. In this scenario, we would most likely not have an associated edge weight. If the robot were traversing on a rugged terrain, then we could have the weights represent the difficultly of passing through the terrain from one space to the other. Robot Path Mapping Robot Path Mapping ^[Subh83, CC BY 3.0 https://creativecommons.org/licenses/by/3.0 , via Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Dijkstras_progress_animation.gif]

Another practical abstraction is in network routing. In this simplified abstraction, nodes would be routers or switches and the edges would be the physical links between them. The edge weights in this case would be the cost of sending a packet from one router to the next. Dijkstra’s algorithm is actively used in protocols such as Intermediate System to Intermediate System (IS-IS) and Open Shortest Path First (OSPF).

V-requirements-analyses

Subsections of V-requirements-analyses

Welcome!

This page is the main page for Performance

Subsections of Performance

Module Outline

YouTube Video

In this module, we will reintroduce the data structures that we have seen and implemented throughout CC315 as well as CC310. We will discuss the running time for various operations as well as space requirements for each structure.

You may recall that in CC310, we did a similar comparison. We will use most of the same operations from that module so we can draw comparisons between the structures in CC310 and CC315.

  • Insert – inserting a specific element into the structure, either in sorted order, or at the appropriate place as required by the definition of the data structure.
  • Access – accessing a desired element. For general data structures, this is the process of accessing an element by its index or key in the structure. For stacks and queues, this was the process of accessing the next element to be returned.
  • Find – this is the process of finding a specific element in the data structure, usually by iterating through it or using a search method if the structure is ordered in some way.
  • Delete – this is the process of deleting a specific element in the case of a general-purpose structure or deleting and returning the next element in the case of a stack or queue. For more advanced structures, this may also require the structure to be reorganized a bit.

We will also also discuss the memory required for each structure.

Trees

YouTube Video

There are three types of trees to consider: generic trees, tries, and binary trees.

General Trees

Tree Tree

  • Insert: In general, inserting an element into a tree by making it a child of an existing tree is a constant time operation. However, this depends on us already having access to the parent tree we’d like to add the item to, which may require searching for that element in the tree as discussed below, which runs in linear time based on the size of the tree. As we saw in our projects, one way we can simplify this is using a hash table to help us keep track of the tree nodes while we build the tree, so that we can more quickly access a desired tree (in near constant time) and add a new child to that tree.

  • Access: Again, this is a bit complex, as it requires us to make some assumptions about what information is available to us. Typically, we can only access the root element of a tree, which can be done in constant time. If we’d like to access any of the other nodes in the tree, we’ll have to traverse the tree to find them, either by searching for them if we don’t know the path to the item (which is linear time based on the size of the tree), or directly walking the tree in the case that we do know the path (which is comparable to the length of the path, or linear based on the height of the tree).

  • Find: Finding a particular node in a generic tree is a linear time operation based on the number of nodes in the tree. We simply must do a full traversal of the tree until we find the element we are searching for, typically either by performing a preorder or postorder traversal. This is very similar to simply looking at each element in a linear data structure.

  • Delete: Removing a child from an existing tree is an interesting operation. If we’ve already located the tree node that is the parent of the element we’d like to remove, and we know which child we’d like to remove, then the operation would be a linear time operation related to the number of children of that node. This is because we may have to iterate through the children of that tree node in order to find the one we’d like to remove. Recall that trees generally store the children of each node in a linked list or an array, so the time it takes to remove a child is similar to the time it takes to remove an element from those structures. Again, if we have to search for either the parent or the tree node we’d like to remove from that parent, we’ll have to take that into account.

  • Memory: In terms of memory usage, a generic tree uses memory that is on the order of the number of nodes in the tree. So, as the number of nodes in the tree is doubled, the amount of memory it uses doubles as well.

Tries

Trie 1 Trie 1

Tries improve on the structure of trees in one important way - instead of using a generic list to store each child, they typically use a statically sized array, where each index in the array directly corresponds to one of the possible children that the node could have. In the case of a trie that represents words, each tree node may have an array of 26 possible children, one for each letter of the alphabet. In our implementation, we chose to use lists instead to simplify things a bit, but for this analysis we’ll assume that the children of a trie node can be accessed in constant time through the use of these static arrays.

In the analysis below, we’ll assume that we are dealing with words in a trie, not individual nodes. We’re also going to be always starting with the root node of the overall trie data structure.

  • Insert: To insert a new word in a trie, the process will run in linear time based on the length of the word. So, to insert a 10 character word in a trie, it should take on the order of 10 steps. This is because we can find each individual node for each character in constant time since we are using arrays as described above, and we’ll need to do that once for each character, so the overall time is linear based on the number of characters. If the path doesn’t exist, we’ll have to create new tree nodes to fill it in, but if the path does exist, it could be as simple as setting the boolean value at the correct node to indicate that it is indeed a word.

  • Access: Similarly to determine if a particular word is in the trie, it will also run in linear time based on the length of the word. We simply must traverse through the path in the tree, and at each node it is a constant time operation to find the correct child and continue onward. So, in total this is a linear time operation.

  • Find: Find is pretty much the same as access, since we can’t just directly jump to a particular node in the tree. So, this is also in linear time based on the length of the word.

  • Delete: Once again, deleting a word simply involves finding it, which runs in linear time. Once it is deleted, we may go through and prune branches that are unused, which is also a linear time operation as we work back upwards in the trie. So, overall the whole operation runs in linear time based on the length of the word.

In summary, pretty much every operation performed on a trie is related to the length of the word that is provided as input. Pretty nifty!

  • Memory: Analyzing the memory usage of a trie is a bit more interesting. Traditionally, we say that the memory consumed by a trie is on the order of $n * m$, where $n$ is the number of words stored in the trie, and $m$ is the length of the longest word in the trie. We come to this value because there could be $n$ words that do not share any prefixes, and if they are all the same length $n$ then we’ll need that many nodes to represent the whole trie. However, in practice the actual memory usage may be somewhat less because of the shared prefixes and shorter words being part of longer words, but remember that we are concerned with the worst case memory usage for a trie, which is $n * m$.

Binary Trees

Tree 1 Tree 1

A binary tree is a tree that is limited to having only two children of each node, and the children and parents are sorted such that every element to the left of the node is less than or equal to its value, and all elements to the right of the node are greater than or equal to its value. Because of this, they perform a little differently than traditional trees.

One major concept related to binary trees is whether the tree is “balanced” or not. A balanced tree has children that differ in height by no more than 1. If the tree is perfectly balanced, meaning all subtrees are balanced, then there are some interesting relationships that develop.

Most notably, the overall height $h$ of the tree is no larger than $log_2(n)$, where $n$ is the number of elements in the tree.

We didn’t cover the algorithms to balance a binary tree in this course since they can be a bit complex, but for this analysis we’ll discuss the performance of binary trees when they are perfectly balanced as well as when they are unbalanced.

  • Insert: To insert a new element in a binary tree, we may have to descend all the way to the bottom of the tree, so the operation is linear based on the height of the tree. If the tree is a perfectly balanced binary tree, then it is on the order of $log_2(n)$. Otherwise, it is linear based on the height, which in the worst case of a completely unbalanced tree could be the number of nodes itself. Once you’ve added the new item, a perfectly balance tree may need to rebalance itself, but that operation is not any more costly than the insert operation.

  • Access: Similarly, to access a particular element in a binary tree, we’ll have to descend through the tree until we find the element we are looking for. At each level, we know exactly which child to check, so it is once again related to the height of the tree. If the tree is a perfectly balanced binary tree, then it is on the order of $log_2(n)$. Otherwise, it is linear based on the height, which in the worst case of a completely unbalanced tree could be the number of nodes itself.

  • Find: Once again, find in this instance is similar to access.

  • Delete: Finally, deleting an element from a binary tree involves finding the element, which will be linear based on the height of the tree. Once the element is removed, then a perfectly balanced tree will need to rebalance itself, which could also take the same amount of time. So, in both cases, it runs on the order of $h$ time, which in the worst case is the total number of nodes $n$ on an unbalanced tree, or $log_2(n)$ on a perfectly balanced tree.

So, most operations involving a perfectly balanced binary tree run in $log_2(n)$ time, which is very efficient when compared to a generic tree. However, if the tree is not balanced, then we cannot make any assumptions about the height of the tree and each operation could require $n$ time, where $n$ is the number of nodes in the tree.

  • Memory: The memory usage of a binary tree is directly correlated with the number of nodes, so we typically say that it is on the order of $n$ memory usage.

Graphs

YouTube Video

There are two types of graphs that we’ve covered in this class: list graphs and matrix graphs. Graphs are slightly different than other data structures, because we may want to find or access both nodes and edges in the graph. So, we’ll need to analyze the performance of graphs with respect to both nodes and edges.

Matrix Graph

Matrix Representation Example 2 Matrix Representation Example 2 Matrix Representation Example 3 Matrix Representation Example 3

Recall that a matrix graph uses an array to store the nodes, and a two-dimensional array to store the edges.

  • Insert Node: Inserting a node is a linear time operation. To insert node, we looped through the nodes attribute and put the node in the first open index. Thus, it is linear with respect to the number of nodes.

  • Access Node: Likewise, given the index of a node, we can get it’s value in constant time by accessing the array.

  • Find Node: To find the index of a node when we are given its value, we must iterate through the array of nodes, which will be a linear time operation based on the number of nodes in the graph.

  • Delete Node: Finally, to remove a node from a graph we can simply set its value in the array of nodes to null. However, we may also need to go through the list of edges and make sure there are no edges to or from that node, so typically this operation runs on the order of the number of nodes in the graph since we must check each one.

For the operations relating to edges below, we’ll assume that we already know the indices of the two nodes we are connecting. If we don’t know those, we’ll have to use the find node process above first.

  • Insert Edge: To insert an edge into the graph we simply update the element in the two-dimensional array, which can be done in constant time.

  • Access Edge: Likewise, to access an edge between two nodes, we simply access the element in the two-dimensional array, which is a constant time operation.

  • Find Neighbors: Instead of finding a particular edge, we’ll say that we want to find all of the neighboring nodes that can be accessed from a given node. In this case, we’ll need to iterate through one row of the two-dimensional array, so the whole process runs on the order of linear time based on the number of nodes in the graph.

  • Delete Edge: To remove an edge, we simply find it in the two-dimensional array and set its value to infinity, which can be done in constant time.

So, for most matrix graph operations, we can do nearly everything in either constant time or, at worst, linear time based on the number of nodes in the graph.

  • Memory: Here is where things get interesting. Generally, we say that a matrix graph consumes $n^2$ memory, where $n$ is the number of nodes in the graph. This is because the two-dimensional array of edges is the vast majority of the memory consumption in a matrix graph. So, if the number of nodes is doubled, the memory usage will quadruple.

List Graph

Example 1 Example 1 List Representation for Example 1 List Representation for Example 1

Recall that a list graph uses an array to store the nodes, and then each node stores a list of edges that start from that node.

  • Insert Node: Inserting a node is a linear time operation. To insert node, we looped through the nodes attribute and put the node in the first open index. Thus, it is linear with respect to the number of nodes.

  • Access Node: Likewise, given the index of a node, we can get it’s value in constant time by accessing the array.

  • Find Node: To find the index of a node when we are given its value, we must iterate through the array of nodes, which will be a linear time operation based on the number of nodes in the graph.

  • Delete Node: Finally, to remove a node from a graph we can simply set its value in the array of nodes to null. However, we may also need to go through each other node and check to make sure it isn’t in the list of edges. So typically this operation runs on the order of the number of nodes in the graph since we must check each one.

So far, a list graph seems to be pretty similar to a matrix graph in terms of performance. The real difference comes with how we handle edges, as well see next.

For the operations relating to edges below, we’ll assume that we already know the indices of the two nodes we are connecting. If we don’t know those, we’ll have to use the find node process above first.

  • Insert Edge: To insert an edge into the graph, we must get the source node from the nodes array and then add an element to the list of edges. Assuming that the edges are stored in a linked list, this is a linear time operation in terms of the number of nodes since we may have to iterate through the list of edges to make sure this edge doesn’t already exist and need updated. In the worst case, there may be $n$ edges here, so it is a linear operation.

  • Access Edge: To access an edge between two nodes, we first find the source node in the list of nodes, which is a constant time operation. Then, we’ll have to iterate through the list of edges, which is at worst linear time based on the size of the graph, since there could be $n$ outgoing edges from this node. So, overall the operation runs on the order of linear time based on the number of nodes in the graph.

  • Find Neighbors: Instead of finding a particular edge, we’ll say that we want to find all of the neighboring nodes that can be accessed from a given node. In this case, we can just find the source node in the array of nodes, which is a constant time operation. Then, we can simply return the list of edges, which is also constant time. So, this operation is very efficient!

  • Delete Edge: To remove an edge, we find the source node and iterate through the list of edges until we find the one to remove. So, this runs in linear time based on the number of nodes in the graph.

So, for most list graph operations, we can also do nearly everything in either constant time or, at worst, linear time based on the number of nodes in the graph. The only real difference comes in how we handle edges, where some operations are a bit slower, but getting a list of all the neighbors of a node is actually a little quicker!

  • Memory: For a list graph, we typically say the memory usage is on the order of $n + e$, where $n$ is the number of nodes in the graph and $e$ is the number of edges in the graph.
Tip

Dense vs. Sparse Graphs

Let’s analyze the memory usage of matrix and list graphs when dealing with dense and sparse graphs. This is the real key difference between the two data structures.

Dense Graph

A dense graph is a graph that has a large number of edges compared to the maximum number of edges possible. More specifically, the maximum number of edges a graph can have is $n^2$, so we would say a dense graph has a value for $e$ that is close to $n^2$. Because of this, the memory usage of a matrix graph ($n^2$) is actually a bit more efficient than a list graph ($n + n^2$) because it doesn’t have the extra overhead of maintaining a list structure for each node.

Sparse Graph

A sparse graph is a graph that has a small number of edges compared to the maximum number of edges possible. So, here we would say that the value of $e$ is much smaller than $n^2$, though it may still be larger than $n$ (otherwise each node would only have one edge coming from it, and this would be a linked list). In that case, we see that $n + e$ is much smaller than $n^2$, and a list graph is much more efficient. If you think about it, in a matrix graph a large number of the entries in the two-dimensional array would be set to infinity and unused, but they still take up memory. Those unused edges wouldn’t exist in a list graph, making it much more memory efficient.

Priority Queues

YouTube Video

Let’s look at the performance of priority queues next. These structures are based on a heap, which has some unique characteristics related to the heap properties that it must maintain.

Recall that a heap is an array which we can view as an unsorted binary tree. This tree must have the following properties:

  1. Each node has at most two children.
  2. If there are nodes in level i of the tree, then level i-1 is full.
  3. The nodes of the last level are as far left as possible.
  • Insert To insert a new element in a priority queue, we place it at the end and then push it upwards until it is in the correct place. Because the heap property creates a perfectly balanced tree, at most it will have to perform $log_2(n)$ or $h$ operations. So, we say that insert runs on the order of $log_2(n)$ where $n$ is the number of elements in the heap.

  • Access Minimum The most common operation for accessing elements in the priority queue is to access the minimum element. Since it should always be the first element in the array due to the heap properties, this is clearly a constant time operation.

  • Find Element To find an item in a priority queue, we must simply iterate through the array that stores the heap, which is a linear time operation based on the number of elements in the heap.

  • Remove Minimum To remove the smallest element, we swap it with the last element and then remove it, then push the top element down into place. Similar to the push up operation, at most it will perform $log_2(n)$ or $h$ operations. So, we say that remove minimum runs on the order of $log_2(n)$ where $n$ is the number of elements in the heap.

  • Heapify This is the most interesting operation of a heap. When we use heapify, we add a large number of elements to the heap and then sort it exactly once by working from the bottom to the top and pushing down each element into place. On the surface, it appears that this should run in the order $n * log_2(n)$ time, since each push down operation takes $log_2(n)$ time, and we have to do that on the order of $n$ times to get each element in place. However, using a bit of mathematical analysis, it is possible to prove that this operation actually runs in linear time $n$ based on the number of elements. The work to actually prove this is a bit beyond the scope of this course, but this StackOverflow discussion is a great explanation of how it works.

  • Memory: In terms of memory usage, a priority queue uses memory that is on the order of the number of elements in the priority queue.

Tip

Heapify and Heap Sort

Why is it important that heapify runs in linear time? That is because we can use heapify and remove minimum to sort data, which is a sorting algorithm known as heap sort.

We already saw that heapify runs in linear time based on the number of nodes, and each remove minimum operation runs in $log_2(n)$ time. To remove all the elements of the heap, we would do that $n$ times, so the overall time would be $n * log_2(n)$ time. If you recall, that is the same performance as merge sort and quicksort!

315 Structure Performance Summary

This page will be devoted to summarizing our performance discussions. Below, we have included a graph for a frame of reference for the various functions.

Graph Graph

Generic Trees

In the following, $n$ denotes the number of nodes in the tree.

  • Insert: 1 if we have the parent but $n$ if we have to find the parent
  • Access: 1 if we want to access the root but $n$ otherwise
  • Find: $n$
  • Delete: $n$ if we have to find the parent
  • Memory: $n$

Tries

In the following, $m$ denotes the length of a word and $n$ denotes the number of words in the trie.

  • Insert: $m$
  • Access: $m$
  • Find: $m$
  • Delete: $m$
  • Memory: $n\times m$

Binary Trees

In the following, $n$ denotes the number of nodes in the tree.

  • Insert: $log_2(n)$ when balanced but $n$ otherwise
  • Access: $log_2(n)$ when balanced but $n$ otherwise
  • Find: $log_2(n)$ when balanced but $n$ otherwise
  • Delete: $log_2(n)$ when balanced but $n$ otherwise
  • Memory: $n$

Matrix Graph

In the following, $n$ denotes the number of nodes in the graph.

  • Insert Node: $n$
  • Access Node: 1
  • Find Node: $n$
  • Delete Node: $n$
  • Insert Edge: 1
  • Access Edge: 1
  • Find Neighbors: $n$
  • Delete Edge: 1
  • Memory: $n^2$

List Graph

In the following, $n$ denotes the number of nodes in the graph and $e$ denotes the number of edges.

  • Insert Node: $n$
  • Access Node: 1
  • Find Node: $n$
  • Delete Node: $n$
  • Insert Edge: $n$
  • Access Edge: $n$
  • Find Neighbors: 1
  • Delete Edge: $n$
  • Memory: $n+e$

Priority Queue

In the following, $n$ denotes the number of elements in the priority queue.

  • Insert: $log_2(n)$
  • Access Minimum: 1
  • Find: $n$
  • Remove Minimum: $log_2(n)$
  • Heapify: $n$
  • Memory: $n$

Tree Algorithms

We will now discuss the performance of the algorithms that we discussed in this course. When examining the performance of an algorithm we will look at the time and the space that it will require.

  • Time: To analyze the time of an algorithm, we will look at the number of operations required to complete the algorithm.
  • Space: To analyze the space requirement of an algorithm, we will look at the various variables within the algorithm. To calculate this, we can add up the space requirements for each variable within the algorithm

Preorder and Postorder

function PREORDER(RESULT)
    append ITEM to RESULT
    FOR CHILD in CHILDREN
           CHILD.PREORDER(RESULT)
end function
function POSTORDER(RESULT)
   FOR CHILD in CHILDREN
           CHILD.POSTORDER(RESULT)
   append ITEM to RESULT
end function
  • Time: The time required for preorder or postorder traversals will be linear with respect to the number of nodes. In a single iteration, we will execute the appending once and we will execute the for loop for each child.
  • Space: Since RESULT is a variable defined and stored outside of the algorithm, it does not factor into our space requirement. Then we must account for the variables CHILD and CHILDREN. In any given iteration, CHILD will be constant and CHILDREN will have size equal to the number of children for the node we are currently at. In total, this would give us a space requirement that is linear with respect to the number of nodes.

Inorder

function INORDER(RESULT)
    LEFTCHILD.INORDER(RESULT)
    append ITEM to RESULT
    RIGHTCHILD.INORDER(RESULT)
end function
  • Time: For any given call to the inorder traversal, it will execute in constant time. We must factor in that we have a recursive function and as such, we will call the inorder traversal for each node. Thus our time requirement will be linear with respect to the number of nodes.
  • Space: Since RESULT is a variable defined and stored outside of the algorithm, it does not factor into our space requirement. Then we must account for the variable ITEM. This will have constant space and thus, the space requirement for the inorder traversal is constant.

Graph Algorithms

Path Searches

1. function DEPTHFIRSTSEARCH(GRAPH,SRC,TAR)
2.     STACK = empty array
3.     DISCOVERED = empty set
4.     PARENT_MAP = empty dictionary
5.     append SRC to STACK
6.     while STACK is not empty
7.         CURR = top of the stack
8.         if CURR not in DISCOVERED
9.             if CURR is TAR
10.                 PATH = empty array
11.                 TRACE = TAR
12.                 while TRACE is not SRC
13.                     append TRACE to PATH
14.                     set TRACE equal to PARENT_MAP[TRACE]
15.                 reverse the order of PATH
16.                 return PATH
17.            add CURR to DISCOVERED
18.            NEIGHS = neighbors of CURR
19.            for EDGE in NEIGHS
20.                NODE = first entry in EDGE
21.                append NODE to STACK
22.                if PARENT_MAP does not have key NODE
23.                    in the PARENT_MAP dictionary set key NODE with value CURR
24.    return nothing
  • Time: The time analysis for the depth first search can be a bit complex. Lines 1 through 5 would execute in near constant time. When we start the while loop on line 6, it is more difficult to analyze how many times this can execute as STACK can contain duplicates. In the case that we have a sparse graph, this would be bound by the number of nodes. For a dense graph however, the number of executions would be bound by the number of edges. The code within the while loop would be bound by the number of nodes because of the check that we have not already discovered the node in line 8. If we haven’t discovered it, we would take either the logic of lines 8 through 16 or lines 17 through 23 but never both in the same iteration. Both of these blocks are bound by the number of nodes in our graph. Thus the worst case time requirement would be $n^2$.
  • Space: Depending on the density of our graph, the space required will be linear with respect to the number of nodes or edges. This is due to the fact that STACK can contain duplicate nodes. If we have a sparse graph then it will be bound by the number of nodes. If we have a dense graph then the space is bound by the number of edges.
    • STACK: linear with respect to the number of edges
    • DISCOVERED: linear with respect to the number of nodes
    • PARENT_MAP: linear with respect to the number of nodes
    • CURR: 1
    • PATH: linear with respect to the number of nodes
    • TRACE: 1
    • NEIGHS: linear with respect to the number of neighbors
    • EDGE: 1
    • NODE: 1
1. function BREADTHFIRSTSEARCH(GRAPH,SRC,TAR)
2.     QUEUE = empty queue
3.     DISCOVERED = empty set
4.     PARENT_MAP = empty dictionary
5.     add SRC to DISCOVERED
6.     add SRC to QUEUE
7.     while QUEUE is not empty
8.         CURR = first element in QUEUE
9.         if CURR is TAR 
10.            PATH = empty list 
11.            TRACE = TAR
12.            while TRACE is not SRC
13.                    append TRACE to PATH
14.                    set TRACE equal to PARENT_MAP[TRACE]
15.                reverse the order of PATH
16.                return PATH
17.        NEIGHS = neighbors of CURR
18.        for EDGE in NEIGHS
19.            NODE = first entry in EDGE
20.            if NODE is not in DISCOVERED
21.                add NODE to DISCOVERED
22.                if PARENT_MAP does not have key NODE
23.                    in the PARENT_MAP dictionary set key NODE with value CURR
24.                append NODE to QUEUE
25.    return nothing
  • Time: The run time expected for breadth first search is a bit more straight forward as QUEUE will never have duplicates. Lines 1 through 6 will all execute in constant time. The while loop will occur $n$ times where $n$ is the number of nodes. Based on the logic, either 9-16 will execute or 17-24 will execute. Both of these are bound by the number of nodes in terms of time. Each iteration of the while loop will take $n$ time and we do the while loop $n$ times; thus the running time will be $n^2$.
  • Space: The space required for BFS will be linear with respect to the number of nodes. We have 5 variables which have size bound by the number of nodes. If the number of nodes doubles, then we expect the amount of space to roughly double.
    • QUEUE: linear with respect to the number of nodes
    • DISCOVERED: linear with respect to the number of nodes
    • PARENT_MAP: linear with respect to the number of nodes
    • CURR: 1
    • PATH: linear with respect to the number of nodes
    • TRACE: 1
    • NEIGHS: linear with respect to the number of nodes
    • EDGE: 1
    • NODE: 1

MSTs

1. function KRUSKAL(GRAPH)
2.     MST = GRAPH without the edges attribute(s)
3.     ALLSETS = an empty list which will contain the sets
4.     for NODE in GRAPH NODES
5.         SET = a set with element NODE
6.         add SET to ALLSETS
7.     EDGES = list of GRAPH's edges
8.     SORTEDEDGES = EDGES sorted by edge weight, smallest to largest
9.     for EDGE in SORTEDEDGES
10.        SRC = source node of EDGE
11.        TAR = target node of EDGE
12.        SRCSET = the set from SETS in which SRC is contained
13.        TARSET = the set form SETS in which TAR is contained
14.        if SRCSET not equal TARSET
15.            UNIONSET = SRCSET union TARSET
16.            add UNIONSET to ALLSETS
17.            remove SRCSET from ALLSETS
18.            remove TARSET from ALLSETS
19.            add EDGE to MST as undirected edge
20.    return MST
  • Time: The time to initialize MST would be linear with respect to the number of nodes. Regardless of the graph implementation, inserting nodes is constant time and we would do it for the number of nodes in GRAPH. Lines 4-6 would take linear time with respect to the number of nodes. Then lines 9-19 would take linear time with respect to the number of edges as it would execute $e$ times and each operation can be done in constant time, except for searching through the sets and performing set operations, which require $log_2(n)$ time. Thus, Kruskal’s algorithm will take time on the order of $e \times log_2(n)$ in the worst case.

  • Space:The required space for Kruskal’s algorithm is dependent on the implementation of the MST. A matrix graph would require $n^2$ space and a list graph we would require $n+e$ space.

    • MST: matrix graph $n^2$ or list graph $n+e$
    • ALLSETS: linear with respect to the number of nodes
    • NODE: 1
    • GRAPH NODES: linear with respect to the number of nodes
    • SET: 1
    • EDGES: linear with respect to the number of edges
    • SORTEDEDGES: linear with respect to the number of edges
    • SRC: 1
    • TAR: 1
    • SRCSET: 1
    • TARSET: 1
    • UNIONSET: 1
1. function PRIM(GRAPH, START)
2.     MST = GRAPH without the edges attribute(s)
3.     VISITED = empty set
4.     add START to VISITED
5.     AVAILEDGES = list of edges where START is the source
6.     sort AVAILEDGES
7.     while VISITED is not all of the nodes
8.         SMLEDGE = smallest edge in AVAILEDGES
9.         SRC = source of SMLEDGE
10.        TAR = target of SMLEDGE
11.        if TAR not in VISITED
12.            add SMLEDGE to MST as undirected edge
13.            add TAR to VISITED
14.            add the edges where TAR is the source to AVAILEDGES
15.        remove SMLEDGE from AVAILEDGES
16.        sort AVAILEDGES
17.    return MST
  • Time without Priority Queue: The time to initialize MST would be linear with respect to the number of nodes. Regardless of the graph implementation, inserting nodes is constant time and we would do it for the number of nodes in GRAPH. With a matrix graph, setting up AVAILEDGES would take linear time with respect to the number of nodes. With a list graph, this would happen in constant time. Then, we need to get the smallest edge from the AVAILEDGES list, which would be a linear time operation based on the number of edges, and we must do that once for up to each edge in the graph. So, the worst case running time for Prim’s algorithm is $e^2$. (Our implementation is actually a bit slower than this since we sort the list of available edges each time, but that is technically not necessary - our implementation is closer to $e^2 \times log_2(e)$!)
  • Time with Priority Queue: We do get an improvement when we choose to implement this with a priority queue. For the most part, the performance is the same. Using a priority queue, heapify would optimize the sorting to happen in linear time with respect to the number of elements. In that case, we can reduce the total running time to on the order of $e \times log_2(n)$, which is the same as Kruskal’s algorithm.
  • Space: The required space for Prim’s algorithm is also dependent on the implementation of the MST. A matrix graph would require $n^2$ space and a list graph we would require $n+e$ space.
    • MST: matrix graph $n^2$ or list graph $n+e$
    • VISITED: linear with respect to the number of nodes
    • AVAILEDGES: linear with respect to the number of edges
    • SMLEDGE: 1
    • SRC: 1
    • TAR: 1

Shortest Path

1. DIJKSTRAS(GRAPH, SRC)
2.    SIZE = size of GRAPH
3.    DISTS = array with length equal to SIZE
4.    PREVIOUS = array with length equal to SIZE
5.    set all of the entries in PREVIOUS to none
6.    set all of the entries in DISTS to infinity 

7.    DISTS[SRC] = 0 
8.    PQ = min-priority queue

9.    loop IDX starting at 0 up to SIZE
10.        insert (DISTS[IDX],IDX) into PQ

11.    while PQ is not empty
12.        MIN = REMOVE-MIN from PQ
13.        for NODE in neighbors of MIN
14.            WEIGHT = graph weight between MIN and NODE
15.            CALC = DISTS[MIN] + WEIGHT
16.            if CALC < DISTS[NODE]
17.                DISTS[NODE] = CALC
18.                PREVIOUS[NODE] = MIN
19.                PQIDX = index of NODE in PQ
20.                PQ decrease-key (PQIDX, CALC)
21.    return DISTS and PREVIOUS
  • Time: The time required for Dijkstra’s algorithm will be $n^2$ in the worst case. We would expect lines 1 through 8 to take constant time. The for loop on line 9 would be bound by the number of nodes. We would expect the for loop on line 13 to be bound by the number of nodes. This for loop will execute each time PQ is not empty (line 11), which is bound by the number of nodes. Thus, the block of code starting at 11 will take $n^2$ time to run in the worst case. This means that if we double the number of nodes, then the running time will be quadrupled. The worst case for Dijkstra’s algorithm is characterized by being a very dense graph, meaning each node has a lot of neighbors. If the graph is sparse and our priority queue is efficient, we could expect this running time to be more along the lines of $(n + e) \times log_2(n)$, where $e$ is the number of edges.
  • Space: The required space for Dijkstra’s algorithm will be linear with respect to the number of nodes. We have 4 variables which have linear size with respect to the number of nodes. We say that this is linear because if we were to double the number of nodes, we would roughly double the space requirement.
    • SIZE: 1
    • DISTS: linear with respect to the number of nodes
    • PREVIOUS: linear with respect to the number of nodes
    • PQ: linear with respect to the number of nodes
    • IDX: 1
    • MIN: 1
    • NODE: 1
    • NEIGHBORS: linear with respect to the number of nodes
    • WEIGHT: 1
    • CALC: 1
    • PQIDX: 1

310 Structure Performance Summary

Stacks

A stack is a data structure with two main operations that are simple in concept. One is the push operation that lets you put data into the data structure and the other is the pop operation that lets you get data out of the structure.

A stack is what we call a Last In First Out (LIFO) data structure. That means that when we pop a piece of data off the stack, we get the last piece of data we put on the stack.

Queues

A queue data structure organizes data in a First In, First Out (FIFO) order: the first piece of data put into the queue is the first piece of data available to remove from the queue.

Lists

A list is a data structure that holds a sequence of data, such as the shopping list shown below. Each list has a head item and a tail item, with all other items placed linearly between the head and the tail.

Sets

A set is a collection of elements that are usually related to each other.

Hash Tables

A hash table is an unordered collection of key-value pairs, where each key is unique.

The following table compares the best- and worst-case processing time for many common data structures and operations, expressed in terms of $N$, the number of elements in the structure.

Data Structure Insert Best Insert Worst Access Best Access Worst Find Best Find Worst Delete Best Delete Worst
Unsorted Array $1$ $1$ $1$ $1$ $N$ $N$ $N$ $N$
Sorted Array $\text{lg}(N)$ $N$ $1$ $1$ $\text{lg}(N)$ $\text{lg}(N)$ $\text{lg}(N)$ $N$
Array Stack (LIFO) $1$ $1$ $1$ $1$ $N$ $N$ $1$ $1$
Array Queue (FIFO) $1$ $1$ $1$ $1$ $N$ $N$ $1$ $1$
Unsorted Linked List $1$ $1$ $N$ $N$ $N$ $N$ $N$ $N$
Sorted Linked List $N$ $N$ $N$ $N$ $N$ $N$ $N$ $N$
Linked List Stack (LIFO) $1$ $1$ $1$ $1$ $N$ $N$ $1$ $1$
Linked List Queue (FIFO) $1$ $1$ $1$ $1$ $N$ $N$ $1$ $1$
Hash Table $1$ $N$ $1$ $N$ $N$ $N$ $1$ $N$

Welcome!

This page is the main page for Requirements Analysis

Subsections of Requirements Analysis

Module Outline

YouTube Video

In this module, we will discuss how we can determine which data structure we should choose when storing real-world data. We’ll typically make this decision based on the characteristics of the data we would like to store. In general, the data structures we’ve learned often work together well and complement each other, as we saw in Dijkstra’s algorithm. In some cases, one structure can be utilized to implement another, such as using a linked list to implement a stack or a queue.

When selecting a data structure for a particular data set, there are a few considerations we should keep in mind.

Does it Work?

The first consideration is: does this data structure work for our data? To answer this, we would want to the characteristics in the data and then determine if our chosen data structure:

  • maintains relationships between elements?
  • translates well to the real world data?
  • efficiently stores the data without extra overhead?
  • can store all aspects of the data clearly and concisely?

Will Someone Else Understand?

Once we have chosen a possible data structure, we also must consider if the choice clearly makes sense to another person who reviews our work. Some questions we may want to ask:

  • is this a natural choice?
  • if I look back at this 6 months from now, would I understand it?
  • do I need to provide extra comments or documentation to explain this choice?

Does it Perform Well?

Finally, we should always consider the performance of our program when selecting a data structure. In some cases, the most straightforward data structure may not always be the most efficient, so at times there is a trade off between performance and simplicity. Consider the hash table structure - it may seem like a more complicated array or linked list. But, if we need to find a specific item in the structure, a hash table is typically much faster due to the use of hashes. That increase in complexity pays off in faster performance. So, when analyzing data structures for performance, we typically look at these things:

  • performance time for tasks
  • space requirements
  • complexity of the code

We will address this particular facet in the next chapter. For now, we will focus on the first two portions.

Types of Data

YouTube Video

The first step toward determining which data structure to use relies heavily on the format of the data. By looking at the data and trying to understand what it contains and how it works, we can get a better idea of what data structure would work best to represent the data. Here, we will discuss some main types of data that we could encounter.

Unique

Unique data is the type of data where each element in a group must be unique - there can be no duplicates. However, in many cases the data itself doesn’t have a sequence or fixed ordering, we simply care if an element is present in the group or not. Some examples of unique data:

  • car license plates
  • personnel ID badge numbers

This data is best stored in the set data structure, usually built using a hash table as we learned previously.

However, many times this data may be included along with other data (a person’s ID number is part of the person’s record), so sets are rarely used in practice. Instead, we can simply modify other data structures to enforce the uniqueness property we desire on those attributes. In addition, we see this concept come up again in relational databases, where particular attributes can be assigned a property to enforce the rule that each element must be unique.

Linear

Linear data refers to data that has a sequence or fixed ordering. This can include

  • a series of characters
  • a list of steps
  • a log of actions taken over time

This data is typically stored in the array or linked list data structure, depending on how we intend to use it and what algorithms will be performed on the data.

We can think of this type of data as precisely one after another. This means that from one element there will be exactly one “next” element.

Finally, we can also adapt these data structures to represent FIFO (first-in, first-out) and LIFO (last-in, first-out) data structures using queues and stacks. As we saw previously, most implementations of queues and stacks are simply adaptations of existing linear data structures, usually the linked list.

Associative

Associative data is data where a key is associated with a value, usually in order to quickly find and retrieve the value based on its key. Some examples are:

  • student records associated with a student ID
  • tax forms stored with a social security number

We typically use a hash table to store this data in our programs.

However, in many cases this data could also be stored in a relational database just as easily, and in most industry programs that may also be a good choice here. Since we haven’t yet learned about relational databases, we’ll just continue to use hash tables for now.

Hierarchical

Hierarchical data refers to data where elements can be seen as above, below, or on the same level as other elements. This can include

  • corporate ladders
  • lineage
  • file systems

In contrast to linear data, hierarchical data can have multiple elements following a single element. In hierarchical data, each element has exactly one element prior to it. We typically use some form of a tree data structure to store hierarchical data.

Relational

Relational data refers to data where elements are ‘close to’ or ‘far from’ other elements. We can also think of this as ‘more similar’ and ’less similar’ as well. This can include

  • cities
  • computer components
  • social media posts
  • graphics

In a relational data set, any element can be ‘related’ to any other element. We typically use a graph data structure to store relational data.

Prioritized

The last type of data we will discuss is prioritized data. Here, we want to store a data element along with a priority, and then be able to quickly retrieve the element with the highest or lowest priority, depending on use. This could include:

  • process switching inside of a computer operating system
  • delivery tasks to perform
  • repairs needed on a device

For this data, we typically use an implementation of a priority queue based on a heap. It is important to remember that the heap does not store the data in sorted order - otherwise we would just use a linear data structure for this. Instead, it guarantees that the next element is always stored at the front of the structure for easy access, and it includes a process to quickly determine the new next element. We’ll learn a bit more about why this is so helpful in the next chapter covering performance.

Info

It is important to note that when we are given a set of requirements for a project, the developer may not use the words to classify the types of data. Based on what a user tells us, we want to be able to infer what kind of shape the data could take.

Trees

In general, trees are good for hierarchical data. While trees can be used for linear data, it is inefficient to implement them in that way (from a certain point of view, a linked list is simply a tree where each node can only have one child). When data points have many predecessors, trees cannot be used. Thus, trees are not suitable for most relational data.

Trees

To recap, we defined a tree as having the following structure:

  • There must be a single root,
  • Each child node has a single parent node,
  • It must be fully connected (no disjoint parts), and
  • There can be no cycles (no loops).

In this course, we used trees to represent family lineage and biologic classifications. Once we have the data in a tree, we can perform both preorder and postorder traversals to see all of the data, and we can use the structure of the tree to determine if two elements are related as ancestor and descendant or not.

We added constraints to trees to give us special types of trees, discussed below.

Tries

A trie is a type of tree with some special characteristics, which are:

  • Each node contains a single character, with the root node containing an empty character by convention
  • By starting at the root and traversing parent to children relationships, we can build user-defined words using those characters, and
  • Each node has a boolean property to indicate if it is the end of a word.

Tries are best suited for data sets in which elements have similar prefixes. In this course we focused on tries to represent words in a language, which is the most common use of tries. We used tries for an auto-complete style application.

Trie example: words starting with &lsquo;car&rsquo; Trie example: words starting with &lsquo;car&rsquo;

Tries are a very efficient way of storing dense hierarchical data, since each node in the trie only stores a single portion of the overall data. They also allow quickly looking up if a particular element or path exists - usually much quicker than looking through a linear list of words.

Binary Trees

A binary tree is a type of tree with some special characteristics, which are:

  • Each node has at most 2 children (nodes can have 0, 1, or 2 children), and
  • unlike general trees, the children in a binary tree are not an unordered set. The children must be ordered such that:
    • all of the descendants in the left tree are less than the parent’s value, and
    • all of the descendants in the right tree are greater than the parent’s value

Binary trees are the ideal structure to use when we have a data set with a well defined ordering. Once we have the data stored in a binary tree, we can also do an inorder traversal, which will access the elements in the tree in sorted order. We will discuss in the next chapter why we might want to choose a binary tree based on its performance.

Graphs

Graphs are a good data structure for relational data. This would include data in which elements can have some sort of similarity or distance defined between those elements. This measure of similarity between elements can be defined as realistically or abstractly as needed for the data set. The distance can be as simple as listing neighbors or adjacent elements.

Graphs

Graphs are multidimensional data structures that can represent many different types of data using nodes and edges. We can have graphs that are weighted and/or directed and we have introduced two ways we can represent graphs:

Matrix Graphs

The first implementation of graphs that we looked at were matrix graphs. In this implementation, we had an array for the nodes and a two dimensional array for all of the possible edges.

List Graphs

The second implementation of graphs were list graphs. For this implementation, we had a single array of graph node objects where the graph node objects tracked their own edges.

Recall that we discussed sparse and dense graphs. Matrix graphs are better for dense graphs since a majority of the elements in the two dimensional array of edges will be filled. A great example of a dense graph would be relationships in a small community, where each person is connected to each other person in some way.

List graphs are better for sparse graphs, since each node only needs to store the outgoing edges it is connected to. This eliminates a large amount of the overhead that would be present in a matrix graph if there were thousands of nodes and each node was only connected to a few other nodes. A great example of a sparse graph would be a larger social network such as Facebook. Facebook has over a billion users, but each user has on average only a few hundred connections. So, it is much easier to store a list of those few hundred connections instead of a two dimensional matrix that has over one quintillion ($10^{18}$) elements.

In the next chapter, we will discuss the specific implications of using one or the other. However, in our requirement analysis it is important to take this into consideration. If we have relational data where many elements are considered to be connected to many other elements, then a matrix graph will be preferred. If the elements of our data set are infrequently connected, then a list graph is the better choice.

Priority Queues

The last structure we covered were priority queues. On their own, these are good for hierarchical data. We discussed using priority queues for ticketing systems where the priority is the cost or urgency. In the project, we utilized priority queues in conjunction with Dijkstra’s algorithm.

Priority Queues

A priority queue is a data structure which contains elements and each element has an associated priority value. The priority for an element corresponds to its importance. For this course, we implemented priority queues using heaps.

A key point of priority queues is that the priority for a value can change. This is reflected by nodes moving up (via push up) or down (via push down) through the priority queue.

In contrast, a tree has a generally fixed order. Consider the file tree as a conceptual example, it is not practical for a parent folder to switch places with a child folder.

Examples

YouTube Video

In real world applications, it won’t always be a straightforward choice to use one structure over another. Users may come to us with unclear ideas of what they are looking for and we will need to be able to infer what structure is best suited for their needs based on what we can learn from them. Typically, those describing applications to us may not be familiar with the nomenclature we use as programmers. Instead, we have to look for clues about how the data is structured and used to help us choose the most appropriate data structures

Below we have some examples with possible solutions. We say possible solutions because there may be other ways that we could implement things. It is important that no matter what structure or algorithm we use, we should always document why we chose our answer. If someone else were to join on the project at a later time, it is important that they understand our reasoning.

Costume Shop

A manager at a costume shop has requested that we do some data management for their online catalog. They say:

  • Users should be able to refine their search first based on who the costumes are for (adults, children, dogs), then on the part of the costume (tops, bottoms, shoes, props) and then display those options for the user.
  • I would like if we could recommend costumes to customers based on their searches.

Take a moment to think on how you might go about this. Remember, there can be multiple ways of solving these problems. We need to be sure that we can articulate and justify our answers.

This is a situation where we could potentially use all three data structures!

First, we could use a tree to represent the search refinement. Thinking along the lines of a data structure, each category and subsequently each product will have exactly one parent. While something like ‘wigs’ shows up in all three categories, you wouldn’t want dog wigs showing up in a search for adult wigs. Thus, there is a unique ancestry for each category and product.

Tree for Search Tree for Search

In this scenario we had a fixed ordering of our hierarchy. If the manager wanted users to be able to sort by costume part and then who it was for, our tree would not hold up.

The second portion that was requested was a recommendation system. We could implement this in two parts. The first would be a graph in which the nodes are the products and they are connected based on the similarity of the products.

For example, a purple childrens wig will be very similar to a childrens blue wig but it would be very different from an adults clown shoes. Thus, our graph would have a heavily weighted connection between the childrens purple and blue wig and there would be an edge with a very small weight between the purple wig and adult clown shoes. We can obtain this information from the manager. Since each product will be connected to every other product, a matrix graph would be best suited here.

Then once we have the graph, we could implement a priority queue. The priority queue would be built based off of the users search. When a user searches for a product, we would refer to our graph and get other similar products and enqueue them. The priority would be similarity to the searched products and the item would be the recommended product. As they continue to search, we would dequeue an element if the user selects it and we would change the priority of elements based on commonalities in the searches.

Video Game

A friend approaches you about their idea for a video game and ask how they could use different data structures to produce the game. They tell us about their game which is a sandbox game with no defined goals. Users can do tasks at their own chosen speed and there is no real “completion” of the game. They say:

  • I want users to be able to see the tasks that are available to them and what the payout of the task is. It would be really awesome if we could display around 4 tasks that would earn them the most points.
  • I have a set of “shortcuts” that users can use to streamline basic functions like switching tools. It would make sense that all of the shortcuts for switching tools start with the same button clicks. So switching to a shovel would be similar to switching to a hammer.
  • I have this world in mind that is a lot like our world in terms of terrain and what not. I want players to be able to dig in the dirt and then move that dirt as they like! They could build dirt piles or houses from the trees.

Take a moment to think on how you might go about this. Remember, there can be multiple ways of solving these problems. We need to be sure that we can articulate and justify our answers.

Again, this is a situation where we could potentially use three data structures!

We can use a priority queue to suggest tasks for the players to do. In this priority queue, the priority would be the payout and the item would be the task itself. As tasks get completed, we would dequeue them.

We can use a trie to represent the set of shortcuts. Below is a small sample of how we can implement our trie.

Trie for Shortcuts Trie for Shortcuts

Since our friend mentioned that similar tasks should have similar combinations, a trie will fit well. These key combinations will have similar prefixes, so we can save ourselves space to store them by using a trie.

Finally, for the world layout we could use a graph. Similar to the maze project or weather station project, we can have nodes represent points on a plot and the edges will represent connections. The nodes will now have three coordinates, (x,y,z), rather than two, (x,y) or (latitude, longitude) and they will have an associated type (dirt, tree, rock, etc.). Graph for world blocks Graph for world blocks

Two nodes will be connected if they directly adjacent. Players can harvest cubes such as soil or limestone, and it would be removed from the world. We would utilize our remove node function to reflect this kind of action. Similarly, players can build up the world in spaces that allow, such as the dirt pile in an open area, and for that we can use our add node function as well as the appropriate add edge functions.

In our implementation of a graph, it would be better to use a list graph. Each block will be connected to at most six other blocks. Cubes neighbors Cubes neighbors

Z-instructor-resources

Subsections of Z-instructor-resources