Using Piccolo XML parser for parsing MEDLINE biomedical documents in XML format

Although Jdom XML parser is easy to understand and implement, it is suitable to only small XML files. The reason is that Jdom XML parser stores the whole XML structure into computer memory. In order to deal with large XML files, we need another kind of XML parser. One that we can use is Piccolo XML parser. In this blog, I would like to explain how I implement java programs for using Piccolo XML parser to parse a large XML file. There is a very good three-part article regarding XML parsers from JavaWorld. I learn how to use Piccolo XML parser from this article, and apply it to my specific task. I also use some java classes from this article. Below are links to the article.

Part 1: http://www.javaworld.com/javaworld/jw-02-2002/jw-0208-xmljava.html

Part 2: http://www.javaworld.com/javaworld/jw-03-2002/jw-0329-xmljava2.html

Part 3: http://www.javaworld.com/javaworld/jw-04-2002/jw-0426-xmljava3.html

Just for a reference, I download Piccolo java library from http://piccolo.sourceforge.net/. The version that I use is Piccolo 1.04 with Sun JDK 1.6.0_02 Please note that what I will explain below might not be the right way to do it. However, it is just the way that works for me.

An XML file that I use in this blog is a list of PubMed articles. It can be obtained by going to NCBI PubMed web site, performing a literature search on any desired topic and then saving results in XML format. So, I’ve attached an example of the XML file with this blog. Here it is: mrtextminer_pubmedxmlexample_piccolo.pdf

From the attached example file, we can see that,

  • The top element is <?XML version=”1.0″?> element. Actually, we can delete this element.
  • The second to top element is <PubmedArticleSet>. This element serves as the root element of all PubMed articles. Its immediate child element is <PubmedArticle>.
  • The <PubmedArticle> element contains information of each PubMed biomedical article such as identification number, title, abstract, etc.

The idea of using Piccolo XML parser is that we do not have to traverse through all elements. In other words, we can choose to walk through only XML elements from which we want to obtain information. For example, for each PubMed article, I want to get its PMID, title, abstract and MeSH terms.

1) So, the first step that we need to do is constructing an XML substructure from the full PubMed XML structure. You do not have to actually write a program to construct this substructure. It just serves as a plan for writing a program to extract data. This new substructure contains necessary XML elements. By “necessary XML elements”, I mean XML elements that contain information I want such as <PMID>, <ArticleTitle>, etc, and XML elements that I need to walk through to get to the XML information elements such as <Article>, etc. In other words, the new XML substructure has the <Article> element because I have to traverse through it before I can get into the <ArticleTitle> element. After constructing the XML substructure, I will write my java programs based on this new substructure instead of the original full XML structure.

<PubmedArticleSet>

<PubmedArticle>

<MedlineCitation>

<PMID></PMID>

<Article>

<ArticleTitle></ArticleTitle>

<Abstract>

<AbstractText></AbstractText>

</Abstract>

</Article>

<MeshHeadingList>

<MeshHeading>

<DescriptorName></DescriptorName>

</MeshHeading>

<MeshHeading>

<DescriptorName></DescriptorName>

</MeshHeading>

</MeshHeadingList>

</MedlineCitation>

<PubmedArticle>

<PubmedArticle>

………

<PubmedArticle>

<PubmedArticleSet>

2) Secondly, I create three java classes, StructureHandler, StructureHandlerBase and StructuredDocumentHandler, by copying the exact source code from JavaWorld links above. I’ve attached those three java classes here: structureddocumenthandler.pdf, structurehandlerbase.pdf and structurehandler.pdf.

The StructureHandler class is an Interface with five methods

  • public void startElement(String lname, Attributes attrs);
  • public void endElement(String lname, String content);
  • public StructureHandler startChild(String lname, Attributes attrs);
  • public void endDirectChild(String lname, String content);
  • public void endStructureChild(String lname, StructureHandler handler);

The StructureHandlerBase class contains an empty implementation of the StructureHandler interface. The StructuredDocumentHandler class extends the org.xml.sax.helpers.DefaultHandler.

Let’s take a minute to understand those five methods above.

  • For XML document, every element has a start element and an end element. For example, <PMID> is the start element, and </PMID> is the end element.
  • In addition, a start xml element can contain one or more attribute. For instance, in the <Article PubModel=”Print”> element, PubModel is an attribute of the Article XML start element, and “Print” is attribute value of the PubModel attribute.
  • The content of an XML element is put between the start element and end element. For example, <PMID>17952859</PMID>, 179527859 is the content of <PMID></PMID> XML element.
  • An XML element can contain zero or more child elements. Direct child is the immediate child element. For example, <DescriptorName></DescriptorName> is the Direct child element of the <MeshHeading> element. Now, you can guess. The structured child is the child XML element that contains one or more XML element. For example, <MedlineCitation></MedlineCitation> is the structured child of the <PubmedArticle></PubmedArticle>. It also a Direct child of the <PubmedArticle></PubmedArticle> element.

Now, we look at the startElement() and endElement() method. When an XML parser traverse each element at each stage, it will go to the start element first. For example, for the <PMID>179527859</PMID> element, the XML parser will go to <PMID> first, then attributes of <PMID> element if existed, then the content of <PMID>content, and finally the end element <PMID>content</PMID>.

In the startElement(), there are two parameters, lname and attrs. The “lname” parameter is referred to the name of the current XML element. For <PMID>, lname is “PMID”. The attrs parameters represent a list of Attributes in the current XML element. In the <PMID> element, there is no attribute. Therefore, if we want to extract attribute values, we need to do it in the startElement() method.

Similarly, in the endElement(), there are two parameters, lname and content. lname is the name of the current XML element, and content is the content of the current XML element. Therefore, if we want to extract content of the XML element, we need to do it in the endElement() method. For example, if I want to get PMID content, in the endElement() method, I need to implement as:

public String endElement(String lname, String content){

String myPMID=content;

return myPMID;

}

What if an XML element does not contain any content but have a number of child elements such as <MedlineCitation> or <Article> elements above. Then, we need to look at the other three methods, startChild(), endDirectChild() and endStructureChild().

March 27, 2009: I attach an example of nine Java programs that I used to extract data from MeSH descriptor below.
concepthandler.java
descriptorhandler
mycontenthandler
mydriver
semantictypehandler
structureddocumenthandler
structurehandler
structurehandlerbase
termhandler

To test the program, put the Java files above into the same directory. Download the MeSH descriptor in XML file (from http://www.nlm.nih.gov/mesh/) into your computer. Then, change the project directory and data directory in the MyDriver.java. Compile all Java files and Run the MyDriver class.

Advertisements
This entry was posted in Java. Bookmark the permalink.

7 Responses to Using Piccolo XML parser for parsing MEDLINE biomedical documents in XML format

  1. Leandro says:

    Hello, MrTextMiner. I’ve come to your blog exacly looking for a fast parser of XML Medline large files. Have you come to use Piccolo? Have you explained the process somewhere else? I’m really interested 🙂

    Thank you very much in advance.

  2. Clement says:

    Hi,
    I am also very interested by your XML parsing of PubMed XML files. We just get a lease of PubMed at Stanford and the first parser I have quickly wrote to extract the same information as you (+ keywords) using DOM takes hours!! (one day for 10000 MedlineCitation elements!!)
    Do you have some results (speed of your system) or do you have some other references that may be useful for me… ?
    Thank you.

  3. mrtextminer says:

    Clement,
    I did not check the speed of my Java Medline parser using Piccolo at the time I worked on this Java parser. Now, I am working on a different project. However, based on Piccolo XML parser web site at http://piccolo.sourceforge.net/bench.html, Piccolo outperforms other XML parsers in speed.

    The reason I use Piccolo XML parser is because of a limitation of my machine (Only 1GB of RAM). When I use Jdom to extract data from a large XML file, I always get out-of-memory error because Jdom stores the whole document in a memory.

  4. Clement says:

    Also, do you have a comparison of your approach with :

    http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=524480

    Clement

  5. mrtextminer says:

    No, I do not have any comparison between my approach and methods implemented in the linked literature. However, it seems that we use the same XML-parsing technique, SAX instead of DOM.

  6. anonymous says:

    Have you tried vtd-xml?
    http://vtd-xml.sf.net

  7. anonymous says:

    Hi,
    I was looking exactly for something like this. I download the example for MESH terms and ran the code and it worked great. I however don’t see the example (“MyContentHandler”) for pubmed parser. I am trying to parse the pubmed data(500 publications in one file) in a tab delimited file with PMID TITLE, Abstract and Mesh terms
    If you have this code can you please provide that

    Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s