In this blog post, we’re going to show you how to use two technologies together: The generative AI functionality in Google Cloud Vertex AI, an ML development platform, and Neo4j, a graph database. Together these technologies can be used to build and interact with knowledge graphs.
The code underlying this blog post is available here.
Why should you use generative AI to build knowledge graphs?
Enterprises struggle with the challenge of extracting value from vast amounts of data. Structured data comes in many formats with well defined APIs. Unstructured data contained in documents, engineering drawings, case sheets, and financial reports can be more difficult to integrate into a comprehensive knowledge management system.
Neo4j can be used to build a knowledge graph from structured and unstructured sources. By modeling that data as a graph, we can uncover insights in that data not otherwise available. Graph data can be huge and messy to deal with. Generative AI on Google Cloud makes it easy to build a knowledge graph in Neo4j and then interact with it using natural language.
The architecture diagram below shows how Google Cloud and Neo4j work together to build and interact with knowledge graphs.
The diagram shows two data flows:
-
Knowledge extraction – On the left side of the diagram, blue arrows show data flowing from structured and unstructured sources into Vertex AI. Generative AI is used to extract entities and relationships from that data which are then converted to Neo4j Cypher queries that are run against the Neo4j database to populate the knowledge graph. This work was traditionally done manually with handcrafted rules. Using generative AI eliminates much of the manual work of data cleansing and consolidation.
-
Knowledge consumption – On the right side of the diagram, green arrows show applications that consume the knowledge graph. They present natural language interfaces to users. Vertex AI generative AI converts that natural language to Neo4j Cypher that is run against the Neo4j database. This allows non technical users to interact more closely with the database than was possible without generative AI
We’re seeing this architecture come up again and again across verticals. Some examples include:
-
Healthcare – Modeling the patient journey for multiple sclerosis to improve patient outcomes
-
Manufacturing – Using generative AI to collect a bill of materials that extends across domains, something that wasn’t tractable with previous manual approaches
-
Oil and gas – Building a knowledge base with extracts from technical documents that users without a data science background can interact with. This enables them to more quickly educate themselves and answer questions about the business.
Now that we have a high level picture of where this technology can be used, let’s focus on a particular example.
Dataset and architecture
In this example we’re going to use the generative AI functionality in Vertex AI to parse resumes. Resumes have information like name, jobs and skills. We’re going to build a knowledge graph from those entities that shows what jobs and skills people have and share with one another.
The architecture to do this is a specific version of the architecture we saw above.
In this case, we have just one data source, rather than many. The data all comes from unstructured text in the resumes.
Once we’ve built the knowledge graph, we’ll use a Gradio application to interact with it using natural language.
Knowledge extraction
To begin, let’s decide on the schema to be used within Neo4j. Neo4j is a schema flexible database, allowing you to bring in new data and relevant schema, connect them to existing ones, or iteratively modify the existing schema based on the use case.
Here is a schema that represents the resume data set:
To transfer unstructured data to Neo4j, we must first extract the entities and relationships. This is where generative AI foundation models like Google’s PaLM 2 can help. Using prompt engineering, the PaLM 2 model can extract relevant data in the format of our choice. In our Talent Finder chatbot example, we can chain multiple prompts using PaLM 2’s “text-bison” model, each extracting specific entities and relationships from the input resume text. Chaining prompts can help us avoid token limitation errors.
The prompt below can be used to extract position and company information as JSON from a resume text:
- code_block
- [StructValue([(u’code’, u’From the Curriculum Vitae text for a job aspirant, extract Entities & relationships strictly as instructed below:rn1. First, look for these Entity types in the text and generate them as comma-separated format.rn `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring to this property to define the relationship between entities. NEVER create new entity types that aren’t mentioned below. You will have to generate as many entities as needed as per the types below:rn Entity Types:rn label:’Position’,id:string,title:string,location:string,startDate:string,endDate:string,url:string //Position Nodern label:’Company’,id:string,name:string //Company Nodern2. Next generate each relationships as triples of head, relationship and tail. To refer to the head and tail entity, use their respective `id` property. NEVER create new Relationship types that aren’t mentioned below:rn Relationship definition:rn position|AT_COMPANY|company //Ensure this is a string in the generated outputrn3. If you cannot find any information on the entities & relationships above, it is okay to return empty value. DO NOT create fictitious datarn4. Do NOT create duplicate entities.rn5. Restrict yourself to extract only Position and Company information. No Education or Skill information should be focussed.rnrnrnExample Output JSON:rn{“entities”: [{“label”:”Position”,”id”:”position1″,”title”:”Software Engineer”,”location”:”Singapore”,startDate:”2021-01-01″,endDate:”present”},{“label”:”Position”,”id”:”position2″,”title”:”Senior Software Engineer”,”location”:”Mars”,startDate:”2020-01-01″,endDate:”2020-12-31″},{label:”Company”,id:”company1″,name:”Neo4j Singapore Pte Ltd”},{“label”:”Company”,”id”:”company2″,”name”:”Neo4j Mars Inc”}],”relationships”: [“position1|AT_COMPANY|company1″,”position2|AT_COMPANY|company2″]}rnrnrnQuestion: <<RESUME TEXT>>rnAnswer:’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1ac3db3610>)])]
The output from text-bison model is then:
- code_block
- [StructValue([(u’code’, u'{rn “entities”: [rn {rn “label”: “Position”,rn “id”: “position1”,rn “title”: “Systems Administrator”,rn “location”: “Key West, FL”,rn “startDate”: “September 1998”,rn “endDate”: “September 2008″rn },rn {rn “label”: “Company”,rn “id”: “company1”,rn “name”: “Keys Federal Credit Union”rn },rn {rn “label”: “Position”,rn “id”: “position2”,rn “title”: “Systems Administrator”,rn “location”: “Key West, FL”,rn “startDate”: “November 2008”,rn “endDate”: “May 2013″rn },rn {rn “label”: “Company”,rn “id”: “company2”,rn “name”: “SAIC”rn },rn {rn “label”: “Position”,rn “id”: “position3”,rn “title”: “Systems Administrator”,rn “location”: “Key West, FL”,rn “startDate”: “May 2013”,rn “endDate”: “March 2015″rn },rn {rn “label”: “Company”,rn “id”: “company3”,rn “name”: “DMI Mobile Solutions”rn },rn {rn “label”: “Position”,rn “id”: “position4”,rn “title”: “Systems Administrator”,rn “location”: “Key West, FL”,rn “startDate”: “April 2015”,rn “endDate”: “November 2018″rn },rn {rn “label”: “Company”,rn “id”: “company4”,rn “name”: “Criterion Systems”rn }rn ],rn “relationships”: [rn “position1|AT_COMPANY|company1”,rn “position2|AT_COMPANY|company2”,rn “position3|AT_COMPANY|company3”,rn “position4|AT_COMPANY|company4”,rn “person1|HAS_POSITION|position1”,rn “person1|HAS_POSITION|position2”,rn “person1|HAS_POSITION|position3”,rn “person1|HAS_POSITION|position4″rn ]rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1ac2a8f5d0>)])]
The text-bison model was able to understand the text and extract information in the output format we wanted. Let’s take a look at how this looks in Neo4j Browser.
The screenshot above shows the knowledge graph that we built, now stored in Neo4j Graph Database.
We’ve now used Vertex AI generative AI to extract entities and relationships from our unstructured resume data. We wrote those into Neo4j using Cypher queries created by Vertex AI generative AI. These steps would previously have been very manual. Generative AI helps automate them, saving time and effort.
Knowledge consumption
Now that we’ve built our knowledge graph, we can start to consume data from it. Cypher is Neo4j’s query language. If we are to build a chatbot, we have to convert the input natural language, English, to Cypher. Models like PaLM 2 are capable of doing this. The base model produces good results, but to achieve better accuracy, we can use two additional techniques:
-
Prompt Engineering – Provide a few samples to the model input to achieve the desired output. We can also try chain of thought prompting, to teach the model how to achieve a certain Cypher output.
-
Adapter Tuning (Parameter Efficient Fine Tuning) – We can also adapter tune the model using sample data. The weights generated this way will stay within your tenant.
The data flow in this case is then:
With a tuned model, we can use a simple prompt to turn text-bison into a Cypher expert as:
- code_block
- [StructValue([(u’code’, u”Context:rnYou are an expert Neo4j Cypher translator who understands the question in english and convert to Cypher strictly based on the Neo4j Schema provided and the instructions below:rn1. Use the Neo4j schema to generate cypher compatible ONLY for Neo4j Version 5rn2. Do not use EXISTS, SIZE keywords in the Cypher. Use alias when using the WITH keywordrn3. Use only Nodes and relationships mentioned in the schema while generating the responsern4. Reply ONLY in Cypherrn5. Always do a case-insensitive and fuzzy search for any properties related search. Eg: to search for a Company name use `toLower(c.name) contains ‘neo4j’`rn6. Candidate node is synonymous to Personrn7. Always use aliases to refer properties in the queryrnNow, use this Neo4j schema and Reply ONLY in Cypher when it makes sense.rnSchema:rnNodes:rn label:’Person’,id:string,role:string,description:string //Person Nodern label:’Position’,id:string,title:string,location:string,startDate:string,endDate:string,url:string //Position Nodern label:’Company’,id:string,name:string //Company Nodern label:’Skill’,id:string,name:string,level:string //Skill Nodern label:’Education’,id:string,degree:string,university:string,graduation_date:string,score:string,url:string //Education NodernRelationships:rn (:Person)-[:HAS_POSITION]->(:Position)rn (:Position)-[:AT_COMPANY]->(:Company)rn (:Person)-[:HAS_SKILL]->(:Skill)rn (:Person)-[:HAS_EDUCATION]->(:Education)rnrnrnQuestion: I have to fill 10 Front end roles. Who are all I have based on ideal skill sets for a front end role?rnAnswer:”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1ac3bfc3d0>)])]
Vertex AI generative AI responds like this in Cypher:
- code_block
- [StructValue([(u’code’, u”MATCH (p:Person)-[:HAS_SKILL]->(s:Skill)rnWHERE (toLower(s.name) CONTAINS ‘html’ OR toLower(s.name) CONTAINS ‘css’ OR toLower(s.name) CONTAINS ‘javascript’ OR toLower(s.name) CONTAINS ‘react’ OR toLower(s.name) CONTAINS ‘angular’)rnRETURN p LIMIT 10″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1ac2ab1050>)])]
As shown above, when the “Front end” role was referred to, the model was able to generate the important front end skills to consider, then generate the Cypher based on that.
Summary
In this blog post, we walked through a two part data flow:
-
Knowledge Extraction – Taking entities and relationships from our resume data and building a knowledge graph from it.
-
Knowledge Consumption – Enabling a user to ask questions of that knowledge graph using natural language.
In each case, it was the unique combination of generative AI capabilities in Google Cloud Vertex AI and Neo4j that made this possible. The approach here automates and simplifies what was previously a very manual process. This opens up applying the knowledge graph approach to a class of problems where it was not previously feasible.
Next steps
We hope you found this blog post interesting and want to learn more. The example we’ve worked through is here. We hope you fork it and modify it to meet your needs. Pull requests are always welcome!
If you have any questions, please reach out to [email protected]